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Abstract 

We provide a unified analysis of the predictive risk of ridge regression 
and regularized discriminant analysis in a dense random effects model. 
We work in a high-dimensional asymptotic regime where p, n —> oo and 
p/n —> 7 £ (0, oo), and allow for arbitrary covariance among the features. 
For both methods, we provide an explicit and efficiently computable ex¬ 
pression for the limiting predictive risk, which depends only on the spec¬ 
trum of the feature-covariance matrix, the signal strength, and the aspect 
ratio 7. Especially in the case of regularized discriminant analysis, we 
find that predictive accuracy has a nuanced dependence on the eigenvalue 
distribution of the covariance matrix, suggesting that analyses based on 
the operator norm of the covariance matrix may not be sharp. Our re¬ 
sults also uncover several qualitative insights about both methods: for 
example, with ridge regression, there is an exact inverse relation between 
the limiting predictive risk and the limiting estimation risk given a fixed 
signal strength. Our analysis builds on recent advances in random matrix 
theory. 


1 Introduction 

Suppose a statistician observes n training examples (27, y,) £ x y drawn 
independently from an unknown distribution V, and wants to find a rule for 
predicting y on future unlabeled draws 2; from T>. In other words, the statistician 
seeks a function h :R P -£ y, h( x) = y(c T 2;) for which E-p [£ (y, h (2;))] is small, 
where i (•, •) is a loss function; in regression y = R. and l is the squared error 
loss, while in classification y = {0, 1} and £ is the 0-1 loss. Such prediction 
problems lie at the heart over several scientific and industrial endeavors in fields 
ranging from genetics (Wray et al., 2007) and computer vision (Russakovsky 
et al., 2014) to Medicare resource allocation (Kleinberg et al., 2015). 

There are various enabling hypotheses that allow for successful prediction 
in high dimensions. These encode domain-specific knowledge and guide model 
fitting. Popular options include the “sparsity hypothesis”, i.e., that there is a 
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good predictive rule depending only on w ■ x for some sparse weight vector w 
(Candes and Tao, 2007; Hastie et al., 2015), the “manifold hypothesis” positing 
that the Xi have useful low-dinrensional geometric structure (Rifai et al., 2011; 
Simard et ah, 2000), and several variants of an “independence hypothesis” that 
rely on independence assumptions for the feature distribution (Bickel and Lev¬ 
ina, 2004; Ng and Jordan, 2001). The choice of enabling hypothesis is important 
from a practical perspective, as it helps choose which predictive method to use, 
e.g., the lasso with sparsity, neighborhood-based methods under the manifold 
hypothesis, or naive Bayes given independent features. 

There are several applications, however, where the above enabling hypothe¬ 
ses are not known to apply, and where practitioners have achieved accurate high¬ 
dimensional prediction using dense—i.e., non-sparse—ridge-regularized linear 
methods trained on highly correlated features. One striking example is the 
case of document classification with dictionary-based features of the form “how 
many times does the j-th word in the dictionary appear in the current docu¬ 
ment.” Even though p n, dense ridge-regularized methods reliably work well 
across a wide range of problem settings (Sutton and McCallum, 2006; Toutanova 
et al., 2003), and sometimes even achieve state-of-the-art performance on im¬ 
portant engineering tasks (Wang and Manning, 2012). As another example, 
in a recent bioinformatics test of prediction algorithms (Bernau et al., 2014), 
ridge regression—and a method that was previously proposed by those same 
authors—performed best, better than lasso regression and boosting. 

The goal of this paper is to gain better understanding of when dense, ridge- 
regularized linear prediction methods can be expected to work well. We focus 
on a random-effects hypothesis: we assume that the effect size of each feature is 
drawn independently at random. This can be viewed as an average-case anal¬ 
ysis over dense parameters. Our hypothesis is of course very strong; however, 
it yields a qualitatively different theory for high-dimensional prediction than 
popular approaches, and thus may motivate future conceptual developments. 

Hypothesis (random effects). Each predictor has a small, independent random 
effect on the outcome. 

This hypothesis is fruitful both conceptually and methodologically. Using 
random matrix theoretic techniques (see, e.g., Bai and Silverstein, 2010), we de¬ 
rive closed-form expressions for the limiting predictive risk of idge-regularized 
regression and discriminant analysis, allowing for the features x to have a gen¬ 
eral covariance structure E. The resulting formulas are pleasingly simple and 
depend on E through the Stieltjes transform of the limiting empirical spectral 
distribution. More prosaically, E only enters into our formulas through the 
almost-sure limits of p~ l tr((E + A/ pxp ) _1 ) and p~ l tr((E + AI pxp )" 2 ), where 
E is the sample covariance and A > 0 the ridge-regularization parameter. No¬ 
tably, the same mathematical tools can describe the two problems. 

From a practical perspective, we identify several high-dimensional regimes 
where mildly regularized discriminant analysis performs strikingly well. Thus, 
it appears that the random-effects hypothesis can at least qualitatively repro¬ 
duce the empirical successes of Bernau et al. (2014), Sutton and McCallum 
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(2006), Toutanova et al. (2003), Wang and Manning (2012), and others. We 
hope that further work motivated by generalizations of the random-effects hy¬ 
pothesis could yield a new theoretical underpinning for dense high-dimensional 
prediction. 


1.1 Overview of Results 


We begin with an informal overview of our results; in Section 1.4, we switch to 
a formal and fully rigorous presentation. In this paper we analyze the predictive 
risk of ridge-regularized regression and classification when n, p —> oo jointly. We 
work in a high-dimensional asymptotic regime where p/n converges to a limiting 
aspect ratio p/n —> 7 > 0. The spectral distribution—i.e., the cumulative 
distribution function of the eigenvalues—of the feature covariance matrix E 
converges weakly to a limiting spectral measure supported on [0, 00 ). This 
allows E to be general, and we will see several examples later. In random 
matrix theory, this framework goes back to Marchenko and Pastur (1967); see, 
e.g., Bai and Silverstein (2010). It has been used in statistics and wireless 
communications by, among others, Couillet and Debbah (2011), Serdobolskii 
(2007), Tulino and Verdu (2004), and Yao et al. (2015). 

In this paper, we present results for ridge regression (Hoerl and Kennard, 
1970) and regularized discriminant analysis (RDA) (Friedman, 1989; Serdobol¬ 
skii, 1983). Our first result derives the asymptotic predictive risk of ridge regres¬ 
sion. We observe a sample from the linear model Y = Xw+e, where each row of 
X is random with Cov [a;] = E and where e is independent centered noise with 
coordinate-wise variance 1. Given a new test example x , ridge regression then 
predicts y = w\ ■ x for wx = (X T X + n\I pxp )~ 1 X T Y; the tuning parameter 
A > 0 governs the strength of the regularization. When the regression coefficient 
w is normally distributed with identity covariance, ridge regression is a Bayes 
estimator, thus a random effects hypothesis is natural. We consider a more 
general random-effects hypothesis where E [w] = 0, and Var [u>] = p _ 1 a 2 J pX p. 
Let a 2 = E [||u>|||] be the expected signal strength, and let E be the sample 
covariance matrix of the features. Then, under assumptions detailed in Section 
2 , we show the following result: 

Theorem (informal statement). The predictive risk of ridge regression, i.e., 


Err (w\) :=E XtV ~v 


(y-w a • x ) 2 


has an almost-sure limit under high-dimensional asymptotics. This limit only 
depends on the signal strength a 2 , the aspect ratio 7 , the regularization param¬ 
eter A, and the Stieltjes transform of the limiting eigenvalue distribution o/E. 
For the optimal tuning parameter, A* = 7 /a 2 

Err( ” x ' ) ^‘*- amW (1) 

where v is the companion Stieltjes transform of the limiting eigenvalue distribu¬ 
tion of E, defined in Section l.j. 
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The required functionals of the limiting empirical eigenvalue distribution can 
be written in terms of almost-sure limits of simple quantities. For example, the 
result ( 1 ) can be written as 


Err (w A *) 


(5 hr 

(e 1 

\a z p 



+ 1 — 7 


-l 


0 . 


For general A, the limiting error rate depends on the almost sure limits of both 
P -1 tr((E + AJpxp) -1 ) and p _1 tr((E + XI pxp )~ 2 ). 

Thanks to the simple form of (1), we can use our results to gain qualitative 
insights about the behavior of ridge regression. We show that, when the signal- 
to-noise ratio is high, i.e., a>l, the accuracy of ridge regression has a sharp 
phase transition at 7 = 1 regardless of E, essentially validating a conjecture of 
Liang and Srebro (2010) on the “regimes of learning” problem. We also find that 
ridge regression obeys an inaccuracy principle , whereby there are no correlation 
structures E for which prediction and estimation of w* are both easy. For 7 = 1 
this simplifies to 


Err (w\) • E 



> a 


2 . 


this bound is tight for optimally-tuned ridge regression. We refer to Section 2.2 
for the general relation. We find the simplicity of the inverse relation remarkable. 

In the second part of the paper, we study regularized discriminant analysis 
in the two-class Gaussian problem 


y ~ {±1} with P [y = 1] = 1/2, and x ~ N (p y , E), (2) 


where fi±\ and E are unknown. While our results cover unequal class probabili¬ 
ties, for simplicity here we present the case when P [y = 1 ] = 1/2. We instantiate 
the random-effects hypothesis by assuming that the pairs (p~i ,i, p+ 17 ) are in¬ 
dependently and identically distributed for i = 1, ..., p, and write a 2 = E [||£|||] 
with S = (/r_|_i — /i_i)/ 2 . We again work in a high-dimensional regime where 
p/n 7 > 0 and the within-class covariance E has a limiting spectral distri¬ 
bution. Given this notation, the Bayes-optimal decision boundary is orthogo¬ 
nal to w* = E _ 1 <5; meanwhile, the regularized discriminant classifier predicts 
y = sign ( w\ ■ x ), where the form of the weight-vector w\ is given in Section 3. 

Theorem (informal statement). In high dimensions, and in the metric induced 
by E, the angle between w* and w\, i.e., 


cose (w\, w*) 


w*)g 

\/(w\, w a ) e (w*, w*)x 


(u, v) E 


u t T,v, 


has an almost-sure limit. The classification error of regularized discriminant 
analysis converges to an almost-sure limit that depends only on this limiting 
angle, as well as the limiting Bayes error. The limiting risk can be expressed in 
terms of a, 7 , A, as well the Stieltjes transform of the limit eigenvalue distribu¬ 
tion of the empirical within-class covariance matrix. 
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We can again use our result to derive qualitative insights about the behavior 
of RDA. We find that the limiting angle between w\ and w* converges to a 
non-trivial quantity as a 2 -> oo, implying that our analysis is helpful in un¬ 
derstanding the asymptotics of RDA even in a very high signal-to-noise regime. 
Finally, by studying the limits A —> 0 and A —► oo, we can recover known high- 
dinrensional asymptotic results about Fisher’s linear discriminant analysis and 
naive Bayes methods going back to Bickel and Levina (2004), Raudys (1967), 
Saranadasa (1993), and even early work by Kolmogorov. 

Mathematically, our results build on recent advances in random matrix the¬ 
ory. The main difficulty here is finding explicit limits of certain trace func¬ 
tionals involving both the sample and the population covariance matrix. For 
instance, the Stieltjes transform m of the empirical spectral distribution satis¬ 
fies 77i(—A) = linip^oo p^ 1 tr((E + XIpxp)^ 1 )- However, standard random ma¬ 
trix theory does not provide simple expressions for the limits of functionals like 
p -1 tr(E(E + A/pxp)” 1 ) or p~ l tr([E(E + \I pxp )]~ 2 ) that involve both E and 
E. For this we leverage and build on recent results, including the work of Chen 
et al. (2011), Hachem et al. (2007), and Ledoit and Peche (2011). Our contri¬ 
butions include some new explicit formulas, for which we refer to the proofs. 
These formulas may prove useful for the analysis of other statistical methods un¬ 
der high-dimensional asymptotics, such as principal component regression and 
kernel regression. 

1.2 A First Example 

A key contribution of our theory is a precise understanding of the effect of cor¬ 
relations between the features on regularized discriminant analysis. Correlated 
features have a non-trivial effect, and cannot be summarized using standard 
notions such as the condition number of E or the classification margin. The full 
eigenvalue spectrum of E matters. This is in contrast with popular analyses of 
high-dimensional classification methods, in which the bounds often depend on 
the operator norm ||E|| op (see for instance the review Fan et al. (2011)), thus 
suggesting that existing analyses of many classification methods are not sharp. 

Consider the following examples: First, E has eigenvalues corresponding to 
evenly-spaced quantiles of the standard Exponential distribution; Second, E 
has a deptli-d BinaryTree covariance structure that has been used in popula¬ 
tion genetics to model the correlations between populations whose evolutionary 
history is described by a balanced binary tree (Pickrell and Pritchard, 2012). In 
both cases, we set the class means n y such as to keep the Bayes error constant 
across experiments. Figure 1 plots our formulas for the asymptotic error rate 
along with empirical realizations of the classification error. 

Both covariance structures are far from the identity, and have similar con¬ 
dition numbers. However, the Exponential problem is vastly more difficult for 
RDA than the BinaryTree problem. This example shows that classical notions 
like the classification margin or the condition number of E cannot satisfactorily 
explain the high-dimensional predictive performance of RDA; meanwhile, our 
asymptotic formulas are accurate even in moderate sample sizes. Our compu- 
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tational results are reproducible and open-source software to do so is available 
from https://github.com/dobriban/high-dim-risk-experiments/. 

1.3 Related Work 

Random matrix theoretic approaches have been used to study regression and 
classification in high-dimensional statistics (Serdobolskii, 2007; Yao et al., 2015), 
as well as wireless communications (Couillet and Debbah, 2011; Tulino and 
Verdu, 2004). Various regression and M-estimation problems have been studied 
in high dimensions using approximate message passing (Bayati and Montanari, 
2012; Donoho and Montanari, 2015) as well as methods inspired by random 
matrix theory (Bean et al., 2013). We also note a remarkably early random 
matrix theoretic analysis of regularized discriminant analysis by Serdobolskii 
(1983). In the wireless communications literature, the estimation properties of 
ridge regression are well understood; however its prediction error has not been 
studied. 

El Karoui and Rosters (2011) study the geometric sensitivity of random 
matrix results, and discuss the consequences to ridge regression and regularized 
discriminant analysis, under weak theoretical assumptions. In contrast, we make 
stronger assumptions that enable explicit formulas for the limiting risk of both 
methods, and allow us to uncover several qualitative phenomena. Our use of 
Ledoit and Peche (2011)’s results simplifies the proof. 

We review the literature focusing on ridge regression or RDA specifically in 
Sections 2.3 and 3.5 respectively. Important references include, among others, 
Bickel and Levina (2004), Dicker (2014), El Karoui (2013), Fujikoshi et al. 
(2011), Hsu et al. (2014), Saranadasa (1993), and Zollanvari and Dougherty 
(2015). 

1.4 Basics and Notation 

We begin the formal presentation by reviewing some key concepts and notation 
from random matrix theory that are used in our high-dimensional asymptotic 
analysis. Random matrix theory lets us describe the asymptotics of the eigen¬ 
values of large matrices (see e.g., Bai and Silverstein, 2010). These results are 
typically stated in terms of the spectral distribution , which for a symmetric 
matrix A is the cumulative distribution function of its eigenvalues: Fa{x) = 
p -1 < x). In particular, the well-known Marclienko-Pastur theo¬ 

rem, given below, characterizes the spectral distribution of covariance matrices. 
We will assume the following high-dimensional asymptotic model: 

Assumption A. (high-dimensional asymptotics) The following conditions hold. 

1. The data X £ R nxp is generated as X = Z E 1 / 2 for an nxp matrix Z with 
i.i.d. entries satisfying E [Z^\ = 0 and Var [Zij] = 1, and a deterministic 
p x p positive semidefinite covariance matrix E. 

2. The sample size n —> oo while the dimensionality p —> oo as well, such 
that the aspect ratio p/n —> 7 > 0 . 
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Figure 1: Classification error of RDA in the BinaryTree and Exponential 
models. The theoretical formula (red, dashed) is overlaid with the results from 
simulations (blue, solid); we also display the oracle error (yellow, dotted). The 
class means are drawn from fi±i ~ Af (0, a 2 p~ 1 I p xp) > where a is calibrated such 
that the oracle classifier always has an error rate of 1%. For BinaryTree, we 
train onn = r )~ 1 p samples, where p = 1024; for Exponential, we use n = 500 
samples. We test the trained model on 10,000 new samples, and report the 
average classification error. Our asymptotically-motivated theoretical formulas 
appear to be accurate here, even though we only have a moderate problem 
size. The parameter A, defined in Section 3, quantifies the strength of the 
regularization. 
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3. The spectral distribution Fj ] of E converges to a limit probability distribu¬ 
tion H supported [0, oo), called the population spectral distribution (PSD). 

Families of covariance matrices E that fit the setting of this theorem include 
the identity covariance, BinaryTree, Exponential and the autoregressive AR-1 
model with E,; ? = (see Grenander and Szego, 1984, for the last one). 

Theorem (Marchenko and Pastur (1967); Silverstein (1995)). Under assump¬ 
tion A, the spectral distribution Fg of the sample covariance matrix E also con¬ 
verges weakly, with probability 1, to a limiting distribution supported on [0, oo). 

The limiting distrbution F is called the empirical spectral distribution (ESD), 
and is determined uniquely by a fixed point equation for its Stieltjes transform , 
which is defined for any distribution G supported on [0, oo) as 

^.rm, , SC \.t 

Jl =o l-z 

Given this notation, the Stieltjes transform of the spectral measure of E satisfies 

mg (z) = - tr ^(e — z Ipxp'j ^ converges to m (z) (3) 

both almost surely and in expectation, for any z € C \ R + ; here, we wrote 
m(z) := rriF (z). We also define the companion Stieltjes transform v(z), which 
is the Stieltjes transform of the limiting spectral distribution of the matrix 
E = n~ 1 XX T . This is related to m(z) by 

j(m(z) + 1/z) = v(z) + 1/z for all zi EC\t + . (4) 

In addition, we write the derivatives as 1 m!(z) = J™ 0 dG(l)/ (l — z) 2 and 
v'(z) = y(m'(z) — z~ 2 ) + z~ 2 . These derivatives can also be understood in 
terms of empirical observables, through the relation 

— tr ^E + A Ipxp'j ^ —ta.s. m!(— A). 

Finally, our analysis also relies on several more recent formulas for limits of 
trace functionals involving both E and E. In particular, we use a formula 
due to Ledoit and Peche (2011), who in the analysis of eigenvectors of sample 
covariance matrices showed that, under certain moment conditions: 2 

p tr (e + A^pxp) 'j -t a .s. ^ ( xv(-A) ~ 1 ) as n > P °°; ( 5 ) 

x We will denote by v'(— A) the derivative of the Stieltjes transform, v'(z), evaluated at 
z — — A; and not the derivative of the function A —>■ v{— A). 

2 See the supplement for more details about this result. 





2 Predictive Risk of Ridge Regression 

In the first part of the paper, we study the predictive behavior of ridge regression 
under large-dimensional asymptotics. Suppose that we have data drawn from 
a p-dimensional random-design linear model with n independent observations 
Hi = Xi ■ w + £{. The noise terms are independent, centered, with variance 
one, and are independent of the other random quantities. The x,; are arranged 
as the rows of the n x p matrix X, and yi are the entries of the n x 1 vector Y. 
We estimate w by ridge regression: w\ = (X T X + A nI. pxp )~ 1 X T Y, for some 
A > 0. We make the following random weights assumption, where a 2 = IE[||xt;|| 2 ] 
is the expected signal strength. 

Assumption B. (random regression coefficients) The true weight vector w is 
random with E [in] = 0, and Var [in] = p _1 a 2 / pxp . 

Our result about the predictive risk of ridge regression is stated in terms 
of the expected predictive risk r\(X) = E [(y — y\) 2 | X ], where (x, y) is taken 
to be an independent test example from the same distribution as the training 
data, and y\ = w\ ■ x. 

Theorem 2.1. Under Assumptions A and B, suppose moreover that the eigen¬ 
values of £ are uniformly bounded above •• 3 lollop < C, for all p. Also assume 
E [Z}f] < C for all p. Then, 

1. Writing 7 p = p/n and A* = "f p a~ 2 , the finite sample predictive risk r\*(X) 
converges almost surely 

rx;{X) = 1 + lE t r (E (£+ ^Jr* P y l ) (6) 

^ a . s . R* (if, a 2 ,7) := 


where A* = 7a 


-2 


2. Moreover, for any A > 0 , the predictive risk converges almost surely to the 
limiting predictive risk R\(H,a 2 ,7), where 


R\(H, a 2 ,7) = 


1 


Au(—A) 


1 + 


Aa 2 

7 


- 11 - 


An'(-A) 

n(-A) 


The choice A* minimizes R\(H, a 2 , 7). 

Proof outline. The proof of part 1 is sufficiently simple to outline here; see the 


3 Below, C will denote an arbitrary fixed constant whose meaning can change from line to 
line. 
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supplement for part 2. We begin by verifying formula (6): 


(x-(w x; ~w^ \X 


r A* (X) = 1 + E 
= 1 + E 

= 1 + a ; 2 a 2 n tr (s (X T X + a; n I pxp ) " 2 ) 


\ T / \ 


) ;~wj 

X 


+ tr(E(X T X + A;n/ pxp ) 1 X T X (X T X + \* p n I pxp ) 


= 1 + ^ tr 




On the last line we used the choice of A*. From the results of Ledoit and Peche 
(2011), and from —► 7, it can be verified that p~ l tr(E(E + 'y P a~ 2 Ip Xp )~ 1 ) 
converges almost surely to limit in (5), finishing part 1. □ 

This result fully characterizes the first order behavior of the predictive risk of 
ridge regression under high-dimensional asymptotics. To verify its finite-sample 
accuracy, we perform a simulation with the BinaryTree and Exponential mod¬ 
els. We compute the limit risks using the algorithms in the supplement. The 
results in Figure 2 show that the formulas given in Theorem 2.1 appear to be 
accurate, even in small sized problems. In Figure 2, for BinaryTree we train on 
n = 7 -1 p samples, where p = 2 4 ; for Exponential on n = 20. We set the signal 
strength to a 2 = 1 and generated w, X, and e as Gaussian random variables 
with i.i.d. entries and the desired variance. The results are averaged over 500 
simulation runs; we evaluated the empirical prediction error using a test set of 
size 100. 

Intriguingly, Figure 2 shows that the prediction performance of ridge regres¬ 
sion is very similar on the two problems. This presents a marked contrast to the 
RDA example given in the introduction, where the two covariance structures 
led to very different classification performance. Thus, it appears that the loss 
function and the spectrum of E can interact non-trivially. 

Meanwhile, in the special case of identity covariance, the quantity R\ — 1 
coincides with the normalized estimation error, so we recover known results 
described in, e.g., Tulino and Verdii (2004). When E = I pxp , we have an explicit 
expression for the Stieltjes transform (e.g., Bai and Silverstein, 2010, p. 52), 
valid for A > 0: 


mi{- A; 7) 


-(1 — 7 + A) + a/(1 — 7 + A) 2 + 47 A 
27A 


(7) 


Theorem 2.1 implies that the limit predictive risk of ridge regression for general 
A equals 


R\ (« 2 ,7) = 1 + 7 7) + A (Aa 2 - 7) m'j(-A; 7), 
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BinaryTree Exponential 








Figure 2: Prediction error of ridge regression in the BinaryTree and 
Exponential model. The theoretical formula (red, dashed) is overlaid with 
the results from simulations (blue, solid). The signals are drawn from 
w ~ J\T (0,p^ 1 Ip X p) ■ For BinaryTree, we train on n = 'y~ 1 p samples, where 
p = 2 4 ; for Exponential on n = 20. We take 100 instances of random training 
data sets, and for each we test on 500 samples. We report the average test error 
over all 50,000 test cases. 
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which has an explicit form. Furthermore, the optimal risk has a particularly 
simple form: 


R*(a 2 , 7 ) 


1 

2 


.. 7- 1 2 
1 H-a 


/ 


i - 


—-+ 4a 2 


See the supplement for details on these derivations. 
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2.1 Regimes of Learning 

As an application of Theorem 2.1, we study how the difficulty of ridge re¬ 
gression depends on the signal strength a 2 . Liang and Srebro (2010) call this 
the regimes of learning problem and argue that, for small a 2 the complexity 
of ridge regression should be tightly characterized by dimension-independent 
Rademacher bounds, while for large a 2 the error rate should only depend on 
7 . Liang and Srebro (2010) justify their claims using generalization bounds for 
the identity-covariance case E = I pxp , and conjecture that similar relationships 
should hold in general. Using our results, we can give a precise characterization 
of the regimes of learning of optimally-regularized ridge regression with general 
covariance E. 

From Theorem 2.1, we know that given a signal strength a 2 , the asymptot¬ 
ically optimal choice for A is A* (a) = 7 /a 2 , in which case the predictive risk of 
ridge regression converges to 


R*(H,a 2 , 7 ) 


1 

A*i>(—A*) ’ 


We now use this formula to examine the two limiting behaviors of the risk, 
for weak and strong signals. The results below are proved in the supplement, 
assuming that the population spectral distribution H is supported on a set 
bounded away from 0 and infinity. 

The weak-signal limit is relatively simple. First, lim Q 2_ ) , 0 R*(H, a 2 , 7 ) = 1, 
reflecting that for a small signal, we predict a near-zero outcome due to a large 
regularization. Second, lim Q 2 _ > 0 (l?*(lL,a 2 , 7 ) — l)/a 2 = E#T, where E jjT is 
the large-sample limit of the normalized traces p - 1 trE. Therefore, for small 
a, the difficulty of the prediction is determined to first order by the average 
eigenvalue, or equivalently by the average variance of the features, and does not 
depend on the size of the ratio 7 = p/n. 

Conversely, the strong-signal limiting behavior of the risk depends the aspect 
ratio 7 , and experiences a phase transition at 7 = 1. When 7 < 1, the predictive 
risk converges to 

Jim R*{H,a 2 , 7 ) = —*— 

a 2 —>-oo 1 — 'y 

regardless of E. This quantity is known to be the re, p —> 00 , p/n —> 7 limit 
of the risk of ordinary least squares (OLS). Thus when p < n and we have a 


12 









very strong signal, ridge regression cannot outperform OLS, although of course 
it can do much better with a small a. 

When 7 > 1, the risk R*(H,a 2 , 7 ) can grow unboundedly large with a. 
Moreover, we can verify that 

Jim a~ 2 R* (H, a 2 , j) = 7 —r > 0 . (9) 

a 2 ->oo 7U(UJ 


Thus, the limiting error rate depends on the covariance matrix through c(0). In 
general there is no closed-form expression for r;( 0 ), which is instead characterized 
as the unique c > 0 for which 


1 

7 



tc 

1 + tc 


dH{t). 


In the special case £ = I pxp , however, the limiting expression simplifies to 
1/( 7 ^( 0 )) = (7 — l)/ 7 - In other words, when p > n, optimally tuned ridge 
regression can capture a constant fraction of the signal, and its test-set fraction 
of explained variance tends to 7 -1 . 

Finally, in the threshold case 7 = 1 , the risk R*(H : a 2 , 7 ) scales with a: 


lim a 1 R*(H 1 a 2 , 7 ) 

OL 1 —>00 


1 


( 10 ) 


where E h [T 1 ] is the large-sample limit of p 1 tr (£ - 1 ). Thus, the absolute 
risk R* diverges to infinity, but the normalized error a~ 2 R*(H 1 a 2 : 7) goes to 
0. This appears to be a rather unusual risk profile. In the case £ = I pxp , 
our expression simplifies further and we get the finite-a formula R* (a 2 , l) = 
( \/4a 2 + 1 + l)/2, which scales like a. 

In summary, we find that for general covariance £, the strong-signal risk 
R*(a 2 , 7) scales as 0(1) if 7 < 1, as 0(a) if 7 = 1, and as 0(a 2 ) if 7 > 1. We 
illustrate this phenomenon in Figure 3, in the case of the identity covariance 
£ = Ipxp- We see that when 7 < 1 the error rate stabilizes, whereas when 
7 > 1, the error rate eventually gets a slope of 1 on the log-log scale. Finally, 
when 7 = 1 , the error rate has a log-log slope of 1 / 2 . 

Thus, thanks to Theorem 2.1, we can derive a complete and exact answer 
the regimes of learning question posed by Liang and Srebro (2010) in the case 
of ridge regression. The results (9) and (10) not only show that the scalings 
found by Liang and Srebro (2010) with £ = I pxp hold for arbitrary £, but make 
explicit how the slopes depend on the limiting population spectral distribution. 
The ease with which we were able to read off this scaling from Theorem 2.1 
attests to the power of the random matrix approach. 


2.2 An Inaccuracy Principle for Ridge Regression 

Our results also reveal an intriguing inverse relationship between the predic¬ 
tion and estimation errors of ridge regression. Specifically, denoting the mean- 
squared estimation error as Re,u{ A) = E [||w)a — u>*|| 2 ], it is well known that 
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Figure 3: Phase transition for predictive risk of ridge regression with identity 
covariance S = I pxp . Error rates are plotted for 7 = 0.25, 0.5, 0.8, 0.9, 1, 1.1, 
1.3, 2, 4, and 8 . 


optimally tuned ridge regression satisfies, under the conditions of Theorem 2.1, 
Re,u{ A*) ^ a.s. Re ■= jm (-A*) for A* = 7 a -2 , 


where m is the Stieltjes transform of the limiting empirical spectral distribution 
(see, e.g., Tulino and Verdu, 2004, Chapter 3). Combining this result with 
Theorem 2.1 and (4), we find the following relationship between the limiting 
predictive risk Rp and the limiting estimation risk Re- 

Corollary 2.2. Under the conditions of Theorem 2.1, the asymptotic predictive 
and estimation risks of optimally-tuned ridge regression are inversely related, 


1 - 


1 

Rp 


= 7 



The equation holds for all limit eigenvalue distributions H of the covari¬ 
ance matrices E. Both sides of the above equation are non-negative: Rp 
cannot fall below the intrinsic noise level Var [Y | X] = 1, while Re < 
limsupp^^ Re,u (A*) < limsup^^ i?._E,n(0) = a 2 . When 7 = 1, we get the 
even simpler equation 

Re Rp = Q: 2 . 


14 










The product of the estimation and prediction risks equals the signal strength. 
Since this holds for the optimal A*, it also implies that for any A we have the 
lower bound Re( A) • Rp{ A) > a 2 ; we find the explicit formula relating the two 
risks remarkable. The inverse relationship may be somewhat surprising, but it 
has an intuitive explanation. 4 When the features are highly correlated and v 
is correspondingly large, prediction is easy because y lies close to the “small” 
column space of the feature matrix X , but estimation of w is hard due to multi- 
collinearity. As correlation decreases, prediction gets harder but estimation gets 
easier. 

2.3 Related Work for High-Dimensional Ridge Regression 

Random-design ridge regression in high dimensions is a thoroughly studied topic. 
In particular, El Karoui (2013) and Dicker (2014) study ridge regression with 
identity covariance E = I pxp in an asymptotic framework similar to ours; this 
special case is considerably more restrictive than a general covariance. The 
study of the estimation error E [ ||t&^ — w\\ 2 ] of ridge regression has received 
substantial attention in the wireless communication literature; see, e.g., Couil- 
let and Debbah (2011) and Tulino and Verdii (2004) for references. To our 
knowledge, however, that literature has not addressed the behavior of predic¬ 
tion error. Finally, we also note the work of Hsu et al. (2014), who provide 
finite-sample concentration inequalities on the out-of-sample prediction error of 
random-design ridge regression, without obtaining limiting formulas. In con¬ 
trast to these results, we give explicit limiting formulas for the prediction error. 


3 Regularized Discriminant Analysis 

In the second part of the paper, we return to regularized discriminant analysis 
and the two-class Gaussian discrimination problem (2). For simplicity we will 
first discuss the case when the population label proportions are balanced. In 
this case, the Bayes oracle predicts using (Anderson, 2003) 

y(x) = sign I 6 E la;--- 11 with 6 =---, 

and has an error rate Erreayes = ‘L (—A„ jP ), where A UtP = V<5 T S —1 <5 is half the 
between-class Mahalanobis distance. The Gaussian classification problem has 
a rich history, going back to Fisher’s pioneering work on linear discriminant 
analysis (LDA). When we have the same number of examples from both the 
positive and negative classes, i.e., n_i = n + 1 = n/ 2, LDA classifies using the 

4 A similar heuristic was given by Liang and Srebro (2010), without theoretical justification. 


15 






linear rule 


V = sign (Vs c 1 ^x - ^ 1 ^ +1 ^ , where 

~ n 

? M+i — M-i ^ 1 t - \® 2 a ' 

o = -^-, £ c = ( x i ~ Vyi) > and Mil 

~ i—1 

Here E c is the centered covariance matrix. In the low-dimensional case where 
n gets large while p remains fixed, LDA is the natural plug-in rule for Gaus¬ 
sian classification and efficiently converges to the Bayes discrimination function 
(Anderson, 2003; Efron, 1975). When p is on the same order as n, however, the 
matrix inverse Ej 1 is unstable and the performance of LDA declines, as dis¬ 
cussed among others by Bickel and Levina (2004). Instead, we will study regu¬ 
larized discriminant analysis, the linear rule y = h (x — (p-i + p + 1 )/ 2 ) where 
h w (x) = sign(u> • x) and the weight vector is 5 u>\ = (E c + \I pX p)~ 1 5 (Friedman, 
1989; Serdobolskii, 1983). 


Y x i 

{i:yi=±l} 


3.1 High-Dimensional Asymptotics 

Throughout this section, we make a random-effects assumption about the class 
means, where the expected signal strength is E [||d|||] = a 2 . We denote the 
classification error of RDA as Err (w\) = P [y ^ signjuA • (x — A)}] - The prob¬ 
ability is with respect to a independent test data point (x, y) from the same 
distribution as the training data. 

Assumption C. (random weights in classification) The following conditions 
hold: 


1 . /i_i and p +1 are randomly generated as /x_i = p — 6 and p+\ = p + S, 
where 5 has i.i.d. coordinates with 


E [<y = 0, Var [<!>,] 



and E 



C 

p2-t-?7/2 


for some fixed constants rj > 0 and C. 

2 . p = (p -1 + p. j_i )/2 is either fixed, or random and independent of 5 , X and 
y, and satisfies limsupp^^ UmIII/p 1 ^ 2 ' 1 ’ < C almost surely for some fixed 
constants ( > 0 and C. 

Theorem 3 . 1 . Under parts 2 and 3 of Assumption A, and Assumption C, 
suppose moreover that that the eigenvalues of E are uniformly bounded as 0 < 
b < A m ; n (E) < A max (E) < B for some fixed constants b and B. Finally, 

5 The notation was chosen to emphasize the similarities between ridge regression and RDA. 
There will be no possibility for confusion with the ridge regression weight vector, also denoted 
w x . 
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suppose that we have a balanced population P [y = 1] = 1/2 and equal class sizes 
n_ i = n_|_i. Then, the classification error of RDA converges almost surely 


Err ( w\) 


$ (—0(A)), where 0(A) = 


(A) 


\/a 2 t? (A) + £ (A) 

and t, r), and f are determined by the problem parameters H and 7: 


T (A) = Xmv, 77 (A) = ———, £(A) = ^-1. 

7 ir 


.Here, m = m(—A) is f/ie Stieltjes transform of the limit empirical spectral distri¬ 
bution F of the covariance matrix T, c , and v = u(—A) is the companion Stieltjes 
transform defined in (4). 

The proof of Theorem 3.1 (in the supplement) is similar to Theorem 2.1, but 
more involved. The main difficulty is to evaluate explicitly the limits of certain 
functionals of the population and sample covariance matrices. We extend the 
result of Ledoit and Peche (2011), and rely on additional results and ideas from 
Hachem et al. (2007) and Chen et al. (2011). In particular, we use a derivative 
trick for Stieltjes transforms, similar to that employed in a related context by 
El Karoui and Rosters (2011), Rubio et al. (2012), and Zhang et al. (2013). The 
limits are then combined with results on concentration of quadratic forms. 

The above result can also be extended to RDA with uneven sampling propor¬ 
tions. Since the limit error rates get more verbose, this is the only place where 
we discuss uneven sampling. Suppose that the conditions of Theorem 3.1 hold, 
except now we have unequal underlying class probabilities P(yi = +1) = 7r + , 
P (gji = —1) = 7r_, and our training set is comprised of n± 1 samples with 
label yi = ±1 such that p/n± 1 —► 7±i > 0. We do not assume that 
n_i/(n_i + n+i) 7r_. Consider a general regularized classifier sign(/A(;r)), 
where f\(x) = [x - (fi+i + /l_i)/2] T (E c + A/ pX p) _1 [/i+i - fi-i] + c for some 
c € K, where £ c , fi±\ are defined in the usual way. We prove in the supplement 
that 


Theorem 3.2. Under the conditions of Theorem 3.1, and with unequal sam¬ 
pling, the classification error of RDA converges almost surely: 

F (sign(/A(a;) y)) ~^ a . s . ?r_$ (-0 _) + tt + 3> (- 0+), (11) 


where the effective classification margins have the form 


±aM-A) + ^i(i-l) + c j 

0± = T-ryy— -j and 


Q = a 


VQ 

1 v — Ai/ 7-1 + 7+1 v' — v 2 


7 (Ac) 


A 2 u 4 


In this section we assumed Gaussianity, but by a Lindeberg-type argument 
it should be possible to extend the result to non-Gaussian observations with 
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matching moments. It should also be interesting to study how sensitive the 
results are to the particular assumptions of our model, such as independence 
across samples. 

It is worth mentioning that the regression and classification problems are 
very different statistically. In the random effects linear model, ridge regression 
is a linear Bayes estimator, thus the ridge regularization E + A I pX p of the co- 
variance matrix is justified statistically. However, for classification, the ridge 
regularization is merely a heuristic to help with the ill-conditioned sample co- 
variance. It is thus interesting to know how much this heuristic helps improve 
upon un-regularized LDA, and how close we get to the Bayes error. We now turn 
to this problem, which can be studied equivalently from a geometric perspective. 


3.2 The Geometry of RDA 

The asymptotics of RDA can be understood in terms of a simple picture. The 
angle between the Bayes decision boundary hyperplane and the RDA discrimi¬ 
nating hyperplane tends to an asymptotically deterministic value in the metric 
of the covariance matrix, and the limiting risk of RDA can be described in terms 
of this angle. 

Recall that, in the balanced 7r + = 7r_ and n + = n_ case, the estimated 
RDA weight vector is w\ = (E c + A/ pxp ) _1 <5, while the Bayes weight vector is 
w* = E _1 <5. Writing (a,b)s = a T Y,b for the inner product in the E-metric, the 
cosine of the angle—in the same metric- between the two is 


cos s (w, W\) 


(■u>a, E 1 <5) s 

IM| S IIE-MHs 


yj wj E w\\/ (5 T E -1 <5 


As discussed earlier, the Bayes error rate for the two-class Gaussian problem 
is Err Bay es = $(-A „ iP ), with A„ iP = v^E” 1 ^ -> a . s . A = a^/En [T -1 ]. 6 
Therefore, from Theorem 3.1 it follows that the cosine of the angle coss(w,w\) 
has a deterministic limit, denoted by T(H, 7, a 2 , A), given by 

T (Hj'y, a 2 , A) =0 (.ff, 7, a 2 , A) /A e [0, 1], 

where © (ff, 7, a 2 , A) is the classification margin of RDA with the dependence on 
each parameter made explicit. The limit of the cosine quantifies the inefficiency 
of the RDA estimator relative to the Bayes one. 

We gain some insight into this angle for two special cases: when H = 5 1, and 
by taking the limit a 2 —> 00 . First, with H = 6 1, or equivalently E = / pxp , we 
curiously find that the effects of the signal strength a 2 and regularization rate 
A decouple completely, as shown in Corollary 3.3 below. See the supplement for 
a proof of the following result. 

6 To show this, we observe that the quantity <5 T E —1 <5 converges almost surely to 
q 2 E/-/ [T~ 1 j under the conditions of Theorem 3.1. This follows by a concentration of quadratic 
forms argument stated in the supplement, and because the spectrum of E converges to ff, so 
in particular E [<5 t E _1 i 5] = a 2 p _1 trS _1 —»• ct 2 E h 
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Corollary 3.3. Under the conditions of Theorem 3.1, let £ = I pxp for all p. 
Then, the limiting cosine T of the angle between the Bayes and RDA hyperplanes 
is 


r(^i, 7 , a 2 , A) = 


a 


'1 + 7 Atoj(—A; 7 ) 


Va 2 + 7 V 1 + 7 TO /(—A;7) 


where the Stieltjes transform mi (—A; 7 ) /or£ = I pxp is given in (7). For 7 = 1, 
this expression simplifies further to 


r(<5i, 1, a 2 , A) 


a 2 [A (A+ 4 )] 1/4 
Va 2 + 1 A 1 / 2 + (A + 4) 1/2 " 


Examining T(i5i, 7 , a 2 , A), we can attribute the suboptimality to two sources 
of noise: We need to pay a price a/\/a 2 + 7 for estimating p±i, while the cost 
of estimating £ is ([1 + yAwf(—A; 7 )]/[1 + 7 W/(—A; 7 )]) 1 / 2 . If we knew that 
£ = I, we could send A —> 00 . It is easy to verify that this would eliminate the 
second term, leading to a loss of efficiency a/^a 2 + 7 . 

In the case of a general covariance matrix £ we get a similar asymptotic 
decoupling in the strong-signal limit a 2 —> 00 . The following claim follows 
immediately from Theorem 3.1. 


Corollary 3.4. Under the conditions of Theorem 3.1, the cosine of the angle 
between the optimal and learned hyperplanes has the limit as a 2 —> 00 : 


lim T(H, 7 , a 2 , A) 

O '—^OO 


A) E H [T-if 


Thus, RDA is in general inconsistent for the Bayes hyperplane in the case of 
strong signals. Corollary 3.4 also implies that, in the limit a —> 00 , the optimal 
A for RDA converges to a non-trivial limit that only depends on the spectral 
distribution H. No such result is true for ridge regression, where A* = a _2 7 —► 0 
as a -> 00 , regardless of £. This contrast arises because ridge regression is a 
linear Bayes estimator, while RDA is a heuristic which is strictly suboptimal to 
the Bayes classifier. 

We illustrate the behavior of the cosine T for the AR-1(0.9) model in Figure 
4, which displays T for values of a ranging from a = 0.1 to a = 2. We see 
that the T-curve converges to its large-a limit fairly rapidly. Moreover, some¬ 
what strikingly, we see that the optimal regularization parameter A*, i.e., the 
maximizer of T, increases with the signal strength a 2 . 

Finally, we note that Efron (1975) studies the angle T in detail for low¬ 
dimensional asymptotics where p is fixed while n —^ 00 ; in this case, T converges 
in probability to 1 , and the sampling distribution of n(l — T) converges to 
a (scaled) X P -i distribution. Establishing the sampling distribution in high 
dimensions is interesting future work. 
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Figure 4: The cosine T(H, 7 , a 2 , A) for the AR-1(0.9) model, with a £ [0.1,2]. 
The values of a used for each curve are evenly spaced, with a gap of 0.05 between 
each curve. The cosine quickly converges to a limit as a increases. 


3.3 Do existing theories explain the behavior of RDA? 

Theorems 3.1 and 3.2 give precise information about the error rate of RDA. It 
is of interest to compare this to classical theories, such as Vapnik-Cliervonenkis 
theory or Rademacher bounds, to if they explain the qualitative behavior of 
RDA. In this section, we study a simple simulation example, and conclude that 
existing theory does not precisely explain the behavior of RDA. 

We consider a setup with n = p = 500, S an auto-regressive (AR-1) matrix 
such that Yiij = and p± 1 ~ A/”(0, a 2 p~ 1 I pxp ). This is a natural model 

when the features can be ordered such that correlations decay with distance; 
for instance in time series and genetic data. We run experiments for different 
values of p in two different settings: o nce with constant effect size a 2 = 1 , 
and once with constant oracle margin ^/E[A^ p ] = 2.3. Given a > 0 and p £ 
[ 0 , 1 ) one can verify using the description of the limit population spectrum 
(Grenander and Szego, 1984), and by elementary calculations, that the limiting 
oracle classification margin in the AR-1 model is A = ayj (1 + p 2 ) / (1 — p 2 ); 
thus, with constant a 2 the oracle classifier improves as p increases. 

Existing results give us some intuition about what to expect. Since n = p, 
classical heuristics based on the theory of Vapnik and Chervonenkis (1971) as 
well as more specialized analyses (Saranadasa, 1993; Bickel and Levina, 2004) 
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Figure 5: Classification error of RDA in an AR-1 model. The theoretical formula 
(red, dashed) is overlaid with the results from simulations (blue, solid; we also 
display the oracle error (yellow, dotted). In the first column, we keep the signal 
strength fixed at a 2 = 1, whereas in the second column we picked a such as 
to fix the oracle error at Erruayes = 0.01. We test on 10,000 new samples, and 
report the average classification error. 
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predict that unregularized LDA will not work. As we will see, this matches 
our simulation results. Meanwhile, Bickel and Levina (2004) study worst-case 
performance of the independence rule relative to the Bayes rule. In our setting, it 
can be verified that their results imply 0j R > (1 — p 2 )/(l+p 2 ) A, where the error 
rate of the independence rule is $(—0 IR ). This predicts that independence rules 
will work better for small correlation p, which again will match the simulations. 

The existing theory, however, is much less helpful for understanding the 
behavior of RDA for intermediate values of A. A learning theoretic analysis 
based on Rademacher complexity suggests that the generalization performance 
of RDA should depend on terms that scale like ^/\\w\\\\tYYi/n x yjX~ 2 p/n for 
large values of A (e.g., Bartlett and Mendelson, 2003). In other words, based on 
a classical approach, we might expect that mildly regularized RDA should not 
work, but using a large A may help. Rademacher theory is not tight enough to 
predict what will happen for A « 1. 

Given this background, Figure 5 displays the performance of RDA for differ¬ 
ent values of p, along with our theoretically derived error from Theorem 3.1. In 
the a 2 = 1 case, we find that—as predicted—unregularized LDA does poorly. 
However, when p is large, mildly regularized RDA does quite well. 

Strikingly, RDA is able to benefit from the growth of the oracle classification 
margin with p, but only if we use a small positive value of A. The analyses based 
on unregularized LDA or “infinitely regularized” independence rules do not cover 
this case. Moreover, this phenomenon is not predicted by Rademacher theory, 
which requires A 1 to improve o ver basi c Vapnik-Chervonenkis bounds. Re¬ 
sults from constant margin case ^EfA^ p ] = 2.3 reinforce the same interpreta¬ 
tions. Finally, our formulas for the error rate are accurate despite the moderate 
sample size n = p = 500. 

3.4 Linear Discriminant Analysis vs. Independence Rules 

Two points along the RDA risk curve that allow for particularly simple analytic 
expressions occur as A —> 0 and A —> oo: the former is just classical linear 
discriminant analysis while the latter is equivalent to an independence rule (or 
“naive Bayes”). In this section, we show how taking these limits we can recover 
known results about the high-dimensional asymptotics of linear discriminant 
analysis and naive Bayes. Further, we compare these two methods over certain 
parameter classes. 

Note that A —> oo leads to a linear discriminant rule with weight vector 
5 = /i + i — /i_-|. Usual independence rules take the form diag(E c ) -1 <5. We will 
assume that all features are normalized to have equal variance, Xfo = a > 0. 
In this case the A —> oo rule corresponds to an independence rule with oracle 
information about the equality of variances; which we still call “indpendence 
rule” for simplicity. 

Extending our previous notation, we define the asymptotic margin of LDA 
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and independence rules, by taking the limits of 0(A) at 0 and oo: 


0lda = lim 


A - >0 ^a 2 rj + 6 


and 0 ir = lim 


^/ct^rf+d 


A—>-oo 


Both limits are well-defined and admit simple expressions, as given below. 

Theorem 3.5. Let H be the limit population spectral distribution of the covari¬ 
ance matrices E; and let T be a random variable with distribution H. Under 
the conditions of Theorem 3.1, the margins of LDA and independence rules are 
equal to 



^a 2 E H [T]+^E H [T 2 ] 


The formula for LDA is valid for 7 < 1 while that for IR is valid for any 7 . 

This result is proved in the supplement. The formulas are simpler than 
Theorem 3.1, as they involve the population spectral distribution H directly 
through its moments. For RDA, the error rate depends on H implicitly through 
the Stieltjes transform of the ESD F. 

These formulas are equivalent to known results, some of which were obtained 
under slightly different parametrization. In particular, Serdobolskii (2007) at¬ 
tributes the IR formula with H = Si to unpublished work by Kolmogorov in 
1967, while the Raudys and Young (2004) attributes it to Raudys (1967). The 
LDA formula was derived by Deev (1970) and Raudys (1972); see Section 3.5. 
Here, our goal was to show how these simple formulas can be recovered from 
the more powerful Theorem 3.1. 

Saranadasa (1993) also obtains closed form expressions for the limit risk of 
two classification methods, the D-criterion and the A-criterion. One can verify 
that these are asymptotically equivalent to LDA and IR, respectively. Our 
results are consistent with those of Saranadasa (1993); but they differ slightly 
in the modeling assumptions. In our notation, his results (as stated in his 
Theorem 3.2 and Corollary 3.1) are: ©lda = oty/E h [T -1 ]\/l — 7 and Qf R = 
a/\JE h [T]. These results are nearly identical to Theorem 3.5, but our equations 
have an extra term involving 7 in the denominator: 7 for LDA and 7 E h[T 2 ] 
for IR. The reason is that we consider p.±i as random, whereas Saranadasa 
(1993) considers them as fixed sequences of vectors; this extra randomness yields 
additional variance terms. 

Theorem 3.5 enables us to compare the worst-case performance of LDA and 
IR over suitable parameter classes of limit spectra. For 0 < ti < 1 < ^ we 
define the class 


n{ki,k 2 ) = {H : E H {[ki,k 2 \) = 1, EffPI = 1} • 


The bounds k\ , &2 control the ill-conditioning of the population covariance ma¬ 
trix. We normalize such that the average population eigenvalue is 1, to ensure 
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that the scaling of the problem does not affect the answer. This parameter 
space is somewhat similar to the one considered by Bickel and Levina (2004). 
Perhaps surprisingly, however, a direct comparison over these natural problem 
classes appears to be missing from the literature, and so we provide it below 
(see the supplement for a proof). 


Corollary 3.6. Under the conditions of Theorem 3.5, consider the behavior of 
LDA and independence rules for H £ TL{k \, /e 2 ). 


1. The worst-case margin of LDA is: 

€>lda{Ti a2 ) := inf QLDA{H,r,a 2 ) = — j] 7 . 

HeH(k 1 ,k 2 ) yja 2 + 7 

The least favorable distribution for LDA from the class TL(k i,fc 2 ) is the 
point mass at 1: H = S\, i.e., X = I. 


2. The worst-case margin for independence rules is: 
@m(H,r,a 2 ) := inf &m{H, "r,a 2 ) = 


HeH(k!,k2) 


a/o 2 + 7(fci + fc 2 - kik 2 ) 


If hi < At 2 the least favorable distribution is the mixture H = WiSk 1 +W 2 Sk 2 , 
where the weights are w i = (&2 — 1)/(^2 — ^l) and W 2 = (1 — Aq)/(A ; 2 — k\); 
while if k\ = k 2 = 1, it is the point mass at 1: H = <5i. 


This result shows a stark contrast between the worst-case behavior of LDA 
and independence rules: for fixed signal strength, the worst-case risk of LDA 
over Tl only depends on 7 , and is attained with the limit of identity covariances 
E = Ipxp regardless of the values of k\, ^ 2 - In contrast, the worst-case behav¬ 
ior of IR occurs for a least favorable distribution H that is as highly spread 
as possible. This highlights the sensitivity of IR to ill-conditioned covariance 
matrices. For 0 < 7 < 1, we see that IR are better than LDA in the worst case 
over H, i.e. Blda( 7 ;o: 2 ) < Qir(%, 7 ; a 2 ), if and only if 

a 2 + 1 > (1 — 7)(fci + fc 2 - fcife). 

In particular IR performs better than LDA for strong signals a; with weaker 
signals, LDA can sometimes have an edge, particularly if the covariance is poorly 
conditioned, quantified by a large measure of spread fci + /c 2 — fci/e 2 = (fc 2 — l)(l — 
ki) + 1 . 


3.5 Literature Review for High-Dimensional RDA 

There has been substantial work in the former Soviet Union on high-dimensional 
classification; references on this work include Raudys and Young (2004), Raudys 
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(2012), and Serdobolskii (2007). Raudys (1967) 7 derived the n, p —» oo asymp¬ 
totic error rate of independence rules in identity-covariance case E = Ipxpi while 
Deev (1970) and Raudys (1972) obtained the error rate of un-regularized linear 
discriminant analysis (LDA) for general covariance E, again in the n, p —> oo 
regime . 8 As shown in the previous section, these results can be obtained as 
special cases of our more general formulas. 

For RDA, Serdobolskii (1983) (see also Chapter 5 of Serdobolskii (2007)) 
considered a more general setting than this paper: classification with a weight 
vector of the form r(E c ) _1 j instead of just (E c + \I pxp 7 1 <5, where the scalar 
function F admits the integral representation T(x) = /(x + t)~ l dri(t) for a 
suitable measure 77 , and is extended to matrices in the usual way. He derived 
a limiting formula for the error rate of this classifier under high-dimensional 
asymptotics. However, due to their generality, his results are substantially more 
involved and much less explicit than ours. In some cases it is unclear to us how 
one could numerically compute his formulas in practice. Furthermore, his results 
are proved when 7 < 1 , and show convergence in probability, not almost surely. 
We also note the work of Raudys and Skurichina (1995), who derived results 
about the risk of usual RDA with vanishingly small regularization A = o(l), and 
for the special case 7 < 1 . 

In another line of work, a Japanese school (e.g., Fujikoshi et al., 2011, and 
references therein) has studied the error rates of LDA and RDA under high- 
dimensional asymptotics, with a focus on obtaining accurate higher-order ex¬ 
pansions to their risk. For instance Fujikoshi and Seo (1998) obtained asymp¬ 
totic expansions for the error rate of un-regularized LDA, which can be verified 
to be equivalent to our results in the A —>• 0 limit. More recently, Kubokawa 
et al. (2013) obtained a second-order expansion of the error rate of RDA with 
vanishingly small regularization parameter A = 0(l/n) in the case 7 < 1. 

Finally, in the signal processing and pattern recognition literature, Zollan- 
vari et al. ( 2011 ) provided asymptotic moments of estimators of the error rate 
of LDA, under an asymptotic framework where n,p —> 00 ; however, this pa¬ 
per assumes that the covariance matrix E is known. More recently, Zollanvari 
and Dougherty (2015) provide consistent estimators for the error rate of RDA 
in a doubly asymptotic framework, using deterministic equivalents for random 
matrices. The goal of our work is rather different from theirs, in that we do 
not seek empirical estimators of the error rate of RDA, but instead seek simple 
formulas that help us understand the behavior of RDA. 
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4 Supplement 

The supplement is organized as follows: Section 5 describes the efficient com¬ 
putation of the risk formulas, Section 6 has the proofs for ridge regression, 
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and Section 7 has the proofs for regularized discriminant analysis. At several 
locations we refer to equation numbers from the main text. 


5 Efficient computation of the risk formulas 

Consider the spectral distribution of the companion matrix Y L = n~ 1 XX T . 
Since its spectral distribution Fg differs from Fg by |n — p\ zeros, it follows 
that £ has a limit ESD F, given by F = 7 F + (1 — 7 )/[o,oo)- The companion 
Stieltjes transform satisfies the Silverstein equation (Silverstein and Combettes, 
1992; Silverstein and Choi, 1995): 


1 f tdH(t) 

v(z) ~ y 1 + tv(z) 


( 12 ) 


It is known that for z £ S := {u + iv : v 7 ^ 0, or v = 0, u > 0}, v(z) is the unique 
solution of the Silverstein equation with 11 ( 2 :) £ S such that sign(Imag{u( 2 :)}) = 
sign(Imag(z)). 

We now explain how to compute the key quantities m,v,m' ,v' that will 
come up in our risk formulas. On the interval v £ [0,oo), Silverstein and Choi 
(1995) prove that the functional inverse of z —► v := v(z) has the explicit form: 


z(v) 


1 ftdHit) 

v + y TT^7' 


(13) 


This result enables the efficient computation of the function 2 : —> v(z) for z < 0. 
Indeed, assuming one can compute the corresponding integral against H , one 
can tabulate (13) on a dense grid of 17 > 0, to find pairs ( Vi , Zi ), where Zi = z(vi). 
Then for the values < 0, the Silverstein and Choi (1995) result shows that 
v(zi) = Vi. Further, the Silverstein equation can be differentiated with respect 
to z to obtain an explicit formula for v' in terms of v, H: 

f t 2 dH(t) \ -1 
dz \u 2 ^ J (1 + tv) 2 J 

Therefore, once v(z) is computed for a value z, the computation of v'(z) can 
be done conveniently in terms of v{z) and H , assuming again that the integral 
involving H can be computed. This is one of the main steps in the Spectrode 
method for computing the limit ESD (Dobriban, 2015). Finally, m{z) can be 
computed from v(z) via the equation (4), and m'(z) can be computed from v'(z) 
by differentiating (4): 7 (ml(z) — I/ 2 2 ) = v'(z) — 1/z 2 . 
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6 Proofs for Ridge Regression 


6.1 Proof of Theorem 2.1 

The risk r\(X) equals 

r\(X) = E \(x ■ w\ — x ■ w — £o) 2 | X] 

= 1 + E [{a: • (w\ — w)} 2 | X] = 1 + E (w x — w) T E (w x — w) | X 


where So is the noise in the new observation. Now w x = (X T X + Xn Ipxp)^ 1 
X T Y , and Y = Xw + s, where e is the vector of noise terms in the original 
data. Hence 


w\ — w = (X T X + \nl p xp) 1 X T (Xw + e) — w 

= -X n{X T X + A nl pxp )- 1 w + (X T X + A nI pxp )~ 1 X T e. 

When we plug this back into the risk formula r x (X), and use that w,e are 
conditionally independent given X, we see that the cross-term involving w,e 
cancels. The risk simplifies to 

r x (X) = 1 + (An) 2 E [w T (A' T X + An/ pxp ) _ 1 E(A T A + Xnl pxp )~ 1 w \ X] 

+ E [e T X(X r X + XnI pXp )- 1 Y,(X T X + XnI pXp )- 1 X T £\X] . 

Now using that the components of w and e are each uncorrelated conditional 
on A, we obtain the further simplification 

2 

r\(X) = 1 -I- (A n) 2 tr ^E (A t X + A nI pXp ) 

+ tr (E (A t A + XnlpxpY 1 X T X (A t A + An/ W )' 1 ) . 

Introducing E = n~ 1 X T X and 7p = p/n, and splitting the last term in two 
by using X T X = X T A + A nl pxp — Xnl pxp this yields 

r A (A) = 1 + ^ tr (E (E + A / pxp ) + (Aa 2 - 7p )^ tr (e (£ + A/ pxp ) . 

(14) 

For the particular choice A* = 7 p a~ 2 , we obtain the claimed formula r A * (A). 
Next, we show the convergence of r x ( A) for arbitrary fixed A. First, by as¬ 
sumption we have 7p —> 7 . Therefore, it is enough to show the almost sure 
convergence of the two functionals: 

— tr ^E ^E A/ pxp ) ^ and — tr ^E ^E T A/ pxp 

The convergence of the first one follows directly from the theorem of Ledoit 
and Peche (2011), given in (5). The second is shown later in the proof of Lemma 
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7.4 in Section 7.1.4. In that section it is assumed that the eigenvalues of £ are 
bounded away from 0 an infinity; but one can check that in the proof of Lemma 
7.4 only the upper bound is used, and that holds in our case. Therefore the risk 
r\(X) converges almost surely for each A. 

Next we find the fomulas for the limits of the two functionals. The 

limit of p -1 tr ^£ ^£ + A/ pX p) ^ equals A) = 7 _ 1 (l/[Au(—A)] — 1) by (5). 

In the proof of Lemma 7.4 in Section 7.1.4, it is shown that the limit of 

p- 1 tr ^£ (2 + A/pxp) ^ is —«/(A) = [v(-A) - At/(-A)]/[ 7 (Aw(-A)) 2 ]. 

Simplified expression for R\: Putting together the results above, we 
obtain (with v = v(—X),v' = v'(—X)) the desired claim: 


Rx 


1 + 7 «(—A) + (Aa 2 - 7 )A[—k'(—A)] 


i + < A ” 2 


7 ) A 


v — Xv' 
j(Xv) 2 





Second part: Convergence: First, we note that for A* = 7 p a 2 , the 
finite sample risk equals by (14) 

Introduce the function k p (X,X) = ^ tr ^£ ^£ + A/ pX p) We need to 

show k p (X*,X) k(A*). First, we notice that A* —> A*, and k p (X,X) —► k(X) 

almost surely. Second, we verify the equicontinuity of k p as a function of A, by 
proving the stronger claim that the derivatives of k p are uniformly bounded: 


\K(x,x)\ 


1 , T \- 2 \ 


tr £ 

— tr ( £ ( £ + Xl pxp J ) 

< 


p \ \ ) ) 


V 


< K. 


The last inequality holds for some finite constant K because E p s X = p 1 tr £ —> 
E hX < 00 by the convergence and boundedness of the spectral distribution of 
£. 

Therefore, by the equicontinuity of the family k p (X,X) as a function of A, 
we obtain k p (X*,X) —> k(X*). Further, the explicit form of shows that the 
limit equals 1 + 7 k(A*) = l/(X*v(— A*)) by (5), as desired. 

Optimality of A* : The limiting risk R\ is the same if we assume Gaussian 
observations. In this case, the finite sample Bayes-optimal choice for X p is 
A* = 7 p a~ 2 . We will use the following classical lemma to conclude that the 
limit of the minimizers A* is the minimizer of the limit. 

Lemma 6.1. Let f n {x) : X —> 1 be an equicontinuous family of functions on 
an interval I, converging pointwise to a continuous function, f n {x) —> f(x). 
Suppose y n is a minimizer of f n on X, and y n —> y. Then y is a minimizer of 

/■ 
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Proof. Since y n is a minimizer of /„, we have 

fn(y n ) < fn(x), (15) 

for all x £ I. But f n (y n ) = f n (y n )-f n (y) + f n (y)-f(y) + f(y). As n ->• oo, 
fn{yn) — fn{y ) —t 0 by the convergence of y n —> y and by the equicontinuity of 
the family {/„}; and f n (x) — f(x) —> 0 for all x by the convergence of /„ —> f. 
Therefore, taking the limit as n —>• oo in (15), we obtain f(y) < fix) for any 
x £ X, showing that y is a minimizer of /. □ 

We use the notation r\ tP {X) = r\(X) for the risk, showing that it depends 
on p. Fix an arbitrary sequence of X matrices on the event having probability 
one where r\ jP (X) —> R\. By an argument similar to the one given above, the 
sequence of functions f p ( A) = r\ p (X) equicontinuous in A on the set A > 0. 
Since f p ( A) converges to R\ for each A > 0, and A* is a sequence of minimizers 
of f p ( A) that converges to A*, by Lemma 6.1 A* is a minimizer of R\. 

6.2 The Risk of Ridge Regression with Identity Covari¬ 
ance 

From Theorem 2.1 it follows that the risk equals the limit r\(X) (14). Since 
E = I this simplifies to 

rx(X) = 1 + ^ tr ((E + A/pxp)” 1 ^ + (A a 2 - 7p )^ tr ((s + A/ pxp )”^ . 

By the Marchenko-Pastur theorem, Eq. (3), it follows that 

p -1 tr ^E + XIpxp') ^ —> mi{— A; 7 ) is defined in (7). 

In the proof of Lemma 7.4 in Section 7.1.4, it is shown that 

p tr ^E + A Ipxp'j ^ —t —k'(X). For an identity covariance matrix k(A) = 

m/(—A; 7 ) by definition of /c(A). Therefore, the limit of the second term 
equals to p (—A; 7 ). We obtain the desired formula: R\ = 1 + 7 W/(—A; 7 ) + 
A (Act 2 — 7 ) rn p (—A; 7 ). For A* = 7 a~ 2 , we obtain R* = 1 + 7 TO/(—A*; 7 ). It is 
a matter of simple algebra to verify the formula ( 8 ) for the risk. 

6.3 Proof of strong-signal limit of ridge 

We will first show the results for the strong-signal limit. We start by verifying 
the following lemma. 

Lemma 6.2. Suppose the limit population eigenvalue distribution H has support 
contained in a compact set bounded away from 0. Let v(z) be the companion 
Stieltjes transform of the ESD. Then 

A Ifl < 1, lim A .j.o Au(—A) = 1 - 7 . 

2. 7/7 > l, lim^o v(—X) = u(0). 
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3. If i = 1, linu.j .0 Xv(—X) 2 = E^T" 1 ]. 

Proof. Let F_ be the ESD of the companion matrix n~ 1 XX T . It is related to F 
via F_ = (1 — 7 )do+ 7 -F. It is well known (e.g., Bai and Silverstein, 2010, Chapter 
6 ), that for H whose support is contained in a compact set bounded away from 
0, the following hold for F and F: if 7 < 1, then F has support contained in a 
compact set bounded away from 0, and F has a point mass of 1 — 7 at 0; while 
if 7 > 1, then F_ has support contained in a compact set bounded away from 0, 
and F has a point mass of 1 — 7" 1 at 0. 

If 7 < 1, we let Y be distributed according to F. Since Y > c > 0 for some 
c, we have by the dominated convergence theorem 


lim Xm(—X) 

A4.0 


lim E p 

A40 


A 

XT Y 


= 0. 


Since Xv(—X) = 1 — 7 + 7 Ato(—A), this shows lim^o Au(—A) = 1 — 7. 

Similarly, if 7 > 1, we let Y be distributed according to F. Since Y > c > 0 


for some c, we have lim^o Av(—A) = lim^oEi? 


A+v 


= Ef 


= v(0). This 


shows lim^o v(— A) = u(0). Further, we can find the equation for u(0) by 
taking the limit as z —> 0 in the Silverstein equation (12). We see that v is 
well-defined, bounded away from 0, and has positive imginary part for 2 € C + 
in a neighborhood of 0 , so the limit is justified by the dominated convergence 
theorem. This leads to the equation that was claimed: 


1 _ f t dH(t) 

u( 0 ) ^ J l+fu( 0 )’ 


Finally, for 7 = 1 , the Silverstein equation (12) is equivalent to 


zv(z) 


dH(t) 

1 + tv(z) 


Therefore the real quantity lim^o v(— A) = lim^oEi? 


A +Y 


is not finite; oth¬ 


erwise, we could take the limit as z —> 0, z £ C + , similarly to above, and obtain 
the contradiction that 0 = f (1 + tu(0 )) _1 dH{t) ^ 0. Since v(—X) is increas¬ 
ing as A l 0, it follows that lim^oT—A) = + 00 . Multiplying the Silverstein 
equation above with v(z) 


zv{z) 2 


dH(t) 

1 /v{z) +t' 


We can take the limit in this equation as A = —z f 0 similarly to above, and 
note that the right hand side converges to — / t~ l dH(t) = — E^p 1-1 ] by the 
dominated convergence theorem. This shows lim^o Xv(— A ) 2 = E^p 1-1 ]. □ 


Consequently, using the formula for the optimal risk from Theorem 2.1, we 
have for 7 < 1 , lim Q 2_ >00 R*{H, a 2 , 7) = (1 — 7) h For 7 > 1, 


lim a 2 R* ( H , a 2 , 7 ) 

a 2 —>00 


lim 

a 2 —>00 


PrT-TX 

OL z V Oi z / 


1 

7^(0) 
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Finally, for 7 = 1, 


lim a 1 R*(H,a 2 , 7 ) 

oc 2 —>00 


lim 

a 2 —>00 



lim 

a 2 —>00 


(sr«(- 


iO 2 ) 


1/2 


1 


The explicit formula for R*[a 2 , 1) is obtained by plugging in the expression (7) 
into the formula for the optimal risk. 

Next we argue about the weak-signal limit. Using the above notations, 
Xv(— A) = Ei?[A/(A + y)], so by the dominated convergence theorem, lim^oo 
Au(—A) = 1, and consequently lim Q 2 _ ) ,Q R*(H, a 2 , 7 ) = 1. Furthermore, A[1 — 
Au(—A)] = 7 Ef[UA/(A + Y)\, leading to lim^oo A[1 — \v{— A)] = 7 Ei?[F] = 
7 E//[T], and consequently lim Q 2 _ > 0 (i?*( 77 , a 2 , 7 ) — l)/a 2 = E#T. 


7 Proofs for Regularized Discriminant Analysis 

7.1 Proof of Theorem 3.1 


We will first outline the high-level steps to prove our main result for classifica¬ 
tion, Theorem 3.1. We break down the proof into several lemmas, whose proof 
is deferred to later sections. These lemmas are then put together to prove the 
theorem in the final part of the proof outline. 

We start with the well-known finite-sample formula for the expected test 
error of an arbitrary linear classifier h w b{x) = sign(w ■ x + b) in the Gaussian 
model ( 2 ), conditional on the weight parameters w,b and the means /j± 1 : 


Err 0 (w, b) = 7 r_$ 


w T fj,- 1 + b 
\! w T T,w 


+ 7T + $ - 


w T n 1 + b 

VuYY/w 


(16) 


In RDA the weight vector is w\ = ^E c + A I pxp ^ <5 and the offset is b = 

<5 T ^E c + A/pxpj A- The first simplification we notice is that b —} a .a. 0 . 


Lemma 7.1. Under the conditions of Theorem 3.1, we have b 


0. 


Lemma 7.1 is proved in Section 7.1.1. Since the denominator in the error rate 
(16) converges almost surely to a fixed, strictly positive constant (see Lemmas 
7.4 and 7.5), this will allow us to use the following simpler formula - that does 
not involve b - in evaluating the limit of the error rate. 


Enq (w) = 7 r_$ 


w T H-i 


7T + $ — 


w T h 1 
V w T Y,w 


(17) 


. vVTEw;, 

Recall that /r_i = p, — 5, /r+i = p + 6. The second simplification we notice 
is that wjp —> a . s . 0 . 


Lemma 7.2. Under the conditions of Theorem 3.1, we have w^ /.1 


0. 
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Lemma 7.2 is proved in Section 7.1.2. By the same argument as above, 
this Lemma allows us to use the following even simpler formula - that does not 
involve ft - in evaluating the limit of the error rate: 

Err 2 (w) = $ ( — 7 ==) • (18) 

\ V w 1 Law J 

To show the convergence of $ f — wj S / yjwj T,ui\ J, we argue that the linear 
and quadratic forms involving wf concentrate around their means, and then 
apply random matrix results to find the limits of those means. We start with 
the numerator. 

Lemma 7.3. We have the limit wl^S —> a .s. ct 2 m(— A), where m(z) is the Stielt- 
jes transform of the limit empirical eigenvalue distribution F of the covariance 
matrix E c . 

Lemma 7.3 is proved in Section 7.1.3. To prove the convergence of the 
denominator, we decompose it as: 


wj^ Sw)^ — (5 t (e c + XIpxp'j E ^E c + A/pxp) 5 — A + 2 B + C (19) 

where M := ^E c + XI pX p^j E ^E c + A I pX p^ and 

A:=5 t MS , B := 5 T M (S - s'j , C := (S - 6^j T M (S - 6^j . 

One can show B —> a s 0 similarly to the analysis of the error terms in the 
proof of Lemmas 7.1 and 7.3; we omit the details. The two remaining terms 
will converge to nonzero quantities. First we show that: 

Lemma 7.4. We have the convergence A := 6 T MS —t a .s. a 2 |«/(A)|, where 

with v the companion Stieltjes transform of the ESD of the covariance matrix, 
defined in (4). Expressing the derivative explicitly, we have the limit A — > a .s. 
(v — Az/)/[ 7 (Au) 2 ]. The limit is strictly positive. 

Lemma 7.4 is proved in Section 7.1.4, using Ledoit and Peche (2011)’s result 
and a derivative trick similar to that employed in a similar context by El Karoui 
and Kosters (2011); Rubio et al. (2012); Zhang et al. (2013). Finally, the last 
statement that we need is: 


Lemma 7.5. We have the limit 

C:= (S-j) T M "AT, 

where v the companion Stieltjes transform of the ESD of the covariance 
matrix, defined in (4). 


35 







Lemma 7.5 is proved in Section 7.1.5, as an application of the results of 
Hachem et al. (2008) and Chen et al. (2011). With all these results, we can now 
prove Theorem 3.1. 

Final proof of Theorem 3.1. By the decomposition (19) and Lemmas 7.4 and 
7.5, we have the convergence 


wj Y,w\ 


v — Xv' 
'y(Xv) 2 


X 2 v 4 


By Lemma 7.5, the second term is strictly positive. Therefore, combining 
with Lemma 7.3 and the continuous mapping theorem, we have 


™J S 



a 2 m(—A) 


a* 


v—Xv' 

7(Av) 2 


v' — v 2 

X 2 v 4 



( 20 ) 


Denote by 0 the parameter on the right hand side. After algebraic simpli¬ 
fication, we obtain that 0 has exactly the form stated in the theorem for the 
margin of RDA. To finish the proof, we show that the error rate is indeed de¬ 
termined by 0. From (20) and the continuous mapping theorem, recalling the 
error rate Err 2 (w) from (18), we have Err 2 (w\) —t a .s. 4>(—0). 

From Lemma 7.2 and the definition of the error rate Erri ( w ) from (17), we 
can move from Err 2 to Erri: Err 2 (wa) ~~ Erri (w\) —> a s 0. 

Finally, from Lemma 7.1 and the definition of the error rate Erro (w) in 
Equation (16), we can discard the offset b , and move from Erri to Erro: 
Erri (w\) - Err 0 (w\, b^j -> a . Sm 0. 

The last three statements imply that Err 0 —> a . s . $(—0), which fin¬ 
ishes the proof of Theorem 3.1. □ 

In the proofs of the lemmas we will use the following well-known statement 
repeatedly: 

Lemma 7.6 (Concentration of quadratic forms, consequence of Lemma B.26 
in Bai and Silverstein (2010)). Let x £ R p be a random vector with i.i.d. 
entries and E [x\ = 0, for which E [(^pXi ) 2 ] = er 2 and sup, E [(y/pXi) i+71 ] 
< C for some 77 > 0 and C < 00 . Moreover, let A p be a sequence of 
random p x p symmetric matrices independent of x, with uniformly bounded 
eigenvalues. Then the quadratic forms x T A p x concentrate around their means: 
x T A p x — p~ 1 cr 2 tr A p —> a .s. 0. 

Lemma 7.6 requires a small proof, which is provided in Section 7.1.6. The 
rest of this section contains the proofs of the lemmas. 


7.1.1 Proof of Lemma 7.1 

We start by conditioning on the random variables p,,S, or equivalently on p±i- 
Conditional on jl, 5, we have that jl + 1 ~ N(p+\, 2E/n), independently of ft -1 ~ 
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Af(p-i,2T,/n). Therefore, 5 ~ Af (5, E/ra), and independently p ~ 7V(//, S/n). 
Further - still conditionally on p, S - it holds that E c and /i±i are independent 
by Gaussianity. This shows that conditionally on p, <5, the random variables 
E c ,p,5 are independent. 

Hence there exist two standard normal random vectors Z, W £ R p indepen¬ 
dent of S c conditionally on p, S, such that we can represent 

<5 = 6+ -^E 1 / 2 ^, p = p-\ — \=Y}' 2 W. (21) 

y/n yjn 

Crucially, this representation has the same form regardless of the value of 
S, p, therefore the random variables Z, W are unconditionally independent of 
5, p, E c . The unconditional indepdendence of S, p, E c , Z, W will lead to conve¬ 
nient simplifications. 

We decompose b according to (21) into the four terms that arise from ex¬ 
panding S , p: 

b = 5 t (E c + A Ipxp^j p = T 1 +T 2 +T 3 + T 4 , 

where L = ^E c + A/ pX p) and 


T\ 

:= 8 T Lp 

( 22 ) 

t 2 

:= n~ 1/2 Z T E 1/2 Lp 

(23) 

t 3 

:= n~ 1/2 8 T LY} /2 W 

(24) 

t 4 

:= n~ 1 Z T Y, 1/2 LY} /2 W. 

(25) 


The proof proceeds by showing that each of the T) converge to zero. 

The first term: T). Let us denote l = Lp. Then by the independence and 
the zero-mean property of the coordinates of 8 

E [{6 r l)*\i\ = l t E [St] + E l * l i E K 2 ] E $] ■ 

i= 1 1<Z7^J<P 

Note that we have E [(y/p<5j) 4 l < C' for some constant C' by the increasing 
relation between L p norms and by E [(v / P<5i) 4+T/ ] < C. Therefore, with C 
denoting some constant whose meaning may change from line to line 


e [(8 T iy\i] 


1 2=1 


or 


E 

1 


i 2 i 2 < 
bS - 


Si 


4 

2 - 


By the boundedness of the operator norm of L, we see ||Z || 2 < ||/x|| 2 /A. By 
assumption, ||/2 || 2 < Cp l ^~ v ^ 2 almost surely for sufficiently large p. Therefore 
E [(5 T Z) 4 ] < Cp _(1+2?)) almost surely for sufficiently large p; and this bound is 
summable in p. Using the Markov inequality for the fourth moment, P(|<5 T Z| > 
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c) < E [(<5 T Z) 4 ] /c 4 , and by the Borel-Cantclli lemma, it follows that S T l —► 0 
almost surely. 

The second term: This term differs from T\ because n~ x ! 2 Z replaces 

S. To show the convergence T) —> a , s , 0 we only used the properties of the first 
four moments of S. The moments of n^ x ^ 2 Z scale in the same way with p as 
the moments of S. Therefore, the same proof shows X 2 —¥ a , s , 0. 

The last two terms: X 3 and X 4 . The convergence of these terms follows 
directly from a well-known lemma, which we cite from Couillet and Debbah 
( 2011 ): 

Lemma 7.1 (Proposition 4.1 in Couillet and Debbah (2011)). Let x n £ R n and 
y n £ K™ be independent sequences of random vectors, such that for each n the 
coordinates of x n and y n are independent random variables. Moreover, suppose 
that the coordinates of x n are identically distributed with mean 0, variance Cjn 
for some C > 0 and fourth moment of order 1 /n 2 . Suppose the same conditions 
hold for y n , where the distribution of the coordinates ofy n can be different from 
those of x n . Let A n be a sequence of n x n random matrices such that ||j4„|| is 
uniformly bounded. Then xfA n y n —> a . a . 0. 

While this lemma was originally stated for complex vectors, it holds verbatim 
for real vectors as well. The lemma applied with x n = <5, y n = n~ l ^ 2 W and 
A n = LE 1 / 2 shows convergence of T 3 —t a . a . 0; and similarly it shows T 4 0. 
This finishes the proof of Lemma 7.1. 


7.1.2 Proof of Lemma 7.2 

This follows from the proof of Lemma 7.2 by noting wjp = S T Lp = T\ + T 2 , 
where L = (£ c + A I pxp ) , and with T) from Equation (25). 


7.1.3 Proof of Lemma 7.3 


Proof. We decompose 

wj6 = S r (Y c + XI pXp y 1 S = 6 r (Y c + XI pXp y 1 S+(5-S) T (s c + A/ pxp ) S. 

(26) 


We will analyze the two terms separately, and show that the second term con¬ 
verges to 0 . 

The first term in (26): 6 T ^E c + A Ipxp'j S. Let us denote by mg ( z ) = 

p - 1 tr{(E c — zlpxp)- 1 } the Stieltjes transform of the empirical spectral distri¬ 
bution (ESD) of E c . We compute expectations using the assumption ES,6j — 
Sija 2 /p (where Sij is the Kronecker symbol that equal 1 if * = j and 0 otherwise), 
and using that 6 is independent of E c : 


E 


<r(E c + A/ pxp ) '<5 


—E 


tr ^E c -(- A/ P xp) 


= a 2 E 


mg (-A) 
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We now argue that E 



—► m(—A), where m(— A) is the Stielt- 


jes transform of the limit ESD F of sample covariance matrices E = 
n~ 1 Yj 1 / 2 E T EYi 1 / 2 , where E has i.i.d. entries of mean 0 and variance 1, and 
the spectrum of E converges to H. Indeed, this follows from the Marchenko- 
Pastur theorem, given in Equation (3): E [mg(—A)] —» m(— A); in addition we 
need to argue that centering the covariance matrix does not change the limit. 
Since our centered covariance matrix differs from the usual centered sample 
covariance matrix, for completeness we state this result as a lemma below: 


Lemma 7.7. Let E c be the centered and rescaled covariance matrix used in 
RDA. Then 


1. Under the conditions of Theorem 3.1, the limit ESD of E c equals, with 
probability 1, the limit ESD F of sample covariance matrices E = 
n _1 E 1 / 2 V’ T V’E 1 / 2 J where V is n x p with i.i.d. entries of mean 0 and 
variance 1. Therefore centering the covariance does not change the limit. 

2. The Stieltjes transform of E c , mg (z) converges almost surely and in ex¬ 
pectation to m(z), the Stieltjes transform of F. For each z £ C \ R + , 
?ng (z) — > a .s m{z); and Emg (z) —> m(z). 


Proof of Lemma 7. 7. Let Ui be the centered data points, Ui = Xi — /i+i for the 
positive training examples, and u, = Xi~p—\ for the negative training examples. 
The Ui have mean 0 and covariance matrix E. Let further v + \ = fi + \ — n + \ be 
the centered mean of the positive training examples; and define U-\ = p-i—p-i 
analogously. We observe that 


E,= 


1 

n — 2 

1 

n — 2 

1 

n — 2 
n 

n — 2 


Y, ( x i- 

£+1 )(Xi 

_ £+i) t 

+ Y (Xi-p,-l)(x 

i 

'“h 

i-Vi = 1 



i-Vi—- 1 



V+l ){Ui 

- £+i) T 

+ Y ( u *-^-l)( M 





i:yi=- 1 

- 

Y UiU i 

+ E 

UiuJ - 

2^+W+i +V-1V-D 



1 


i--Vi =-1 


y}/ 2 v t pv e 1 / 2 


n 


where V is the n x p matrix with each row equal to E -1 / 2 zq, which are i.i.d. 
Gaussian vectors with i.i.d. entries of mean 0 and variance 1. Also, P is the 
projection matrix P = J pxp —2?r _1 (e + ieJ 1 +e_ie]i 1 ), where the vectors e±i are 
the indicator vectors of the training examples with labels ±1. Recalling that 
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E = n -iYi/2 V T vY}/ 2 is the uncentered, 1 /n-normalized covariance matrix, 
the difference between X and X c is 


X - X c = -X 1 / 2 V T UX 1/2 -— -X 1 / 2 V T PVT } /2 

n n — 2 |_n 

= . 2 n . s 1 / 2 v T vs 1/2 + —!— E 1 / 2 y T (7 pX p - p)ys 1 / 2 = ri + r 2 . 

n(n — 2 ) n — 2 

It is easy to check that the two error terms Id and r 2 are small. Specifically, 
we can verify that the Frobenius norm 11T 1 11p r —> 0. Therefore, By Corollary 
A.41 in Bai and Silverstein (2010) it follows that Tj can be ignored when com¬ 
puting the limit ESD. Further, I pxp — P is of rank at most two by the definition 
of P. Therefore, by Theorem A.44 in Bai and Silverstein (2010), T 2 does not af¬ 
fect the limit ESD. Putting these together, it follows that X c has the same ESD 
as X; the latter exists due to the Marchenko-Pastur theorem given in Equation 
(3). Therefore, the ESD of X c converges, with probability 1, to F. This finishes 
the first claim in the Lemma. 

Claim 2 then follows immediately by the properties of weak convergence of 
probability measures, because the Stieltjes transform is a bounded continuous 
functional of a probability distribution. 

□ 


Therefore, going back to the proof of Lemma 7.3, we obtain using Lemma 7.7, 

-l 


that E 




(—A) —► m(—A). Now note that each eigenvalue of ^X c + A/ pxp ^ 


is uniformly bounded in [0,1/A], <5 is independent of X c , and the 4 + 77 -th mo¬ 
ment of y 7 p5i is uniformly bounded, so by the concentration of quadratic forms, 
Lemma 7.6: 


iT ( 


X, + \Ipx P ^j S — E 


<5 t (j 


XL 


pxp 


Putting everything together, we have shown that the first term converges 
almost surely to a 2 m{—X). 

The second term in (26) converges to zero: Using the decom¬ 

position (21) from the proof of Lemma 7.1 in Section 7.1.1, and the nota¬ 
tion L = ^X c + A I pxp ^j we can write e 2 := <5 T ^X c + XI pxp ''j ^<5 — <5^ = 

-X=^LY}' 2 Z. 

y/n 

This has the same distribution as X 3 = n _ 1 / 2 (5 T LX 1 // 2 IU, because Z , W 
are identically distributed, independently of <5, E c . In Lemma 7.1, we showed 
T;i —> a .s. 0 , so e 2 —> a . s . 0 - 0 


7.1.4 Proof of Lemma 7.4 

In this proof it will be helpful to use some properties of smooth complex func¬ 
tions. Let I? be a domain, i.e. a connected open set of C. A function / : D —> C 
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is called analytic on D if it is differentiable as a function of the complex vari¬ 
able z on D. The following key theorem, sometimes known as Vitali’s theorem, 
ensures that the derivatives of converging analytic functions also converge. 

Lemma 7.8 (see Lemma 2.14 in Bai and Silverstein (2010)). Let fi, / 2 , ■ • ■ be 
analytic on the domain D, satisfying \f n (z)\ < M for every n and z in D. 
Suppose that there is an analytic function f on D such that f n (z) —> f(z) for 
all z £ D. Then it also holds that f' n {z) —> f'(z) for all z £ D. 


We can now prove Lemma 7.4. 


Proof of Lemma 7-4 ■ We need to prove the convergence of A = d T MS, 
where M = ^E c + A/ pxp ^ £ ^E c + A/ pxp ^ • Using the assumption 


E<5,6j = 


bijOt 2 /p , 


—E 

p 


tr E 


pxp 

and that 
, -2 

V A Ipxp 


S, E c are indpendent, we get E 


A 


(pc T A/ pxp ^ 


Now, the eigenvalues of E are bounded above by B, and those of 
(T, c + XIpxp^ are bounded above by 1/A; this shows the eigenvalues of L 
are uniformly bounded above. Further, <5, L are independent, and <5 has 4 + ?y-th 
moments, so we can use the concentration of quadratic forms, Lemma 7.6, to 


conclude A — E 


A 


A 


, l s 0. It remains to get the limit of E 
To do this we use a derivative trick; this technique is similar to those em¬ 
ployed in a similar context by El Karoui and Kosters (2011); Rubio et al. (2012); 
Zhang et al. (2013). We will construct a function with two properties: (1) its 


derivative is the quantity E 


A 


that we want, and (2) its limit is convenient to 


obtain. Based on these two properties, Vitali’s theorem will allow us to obtain 


the limit of E 


A 


Accordingly, consider two general pxp positive definite matrices D , E and 
introduce the function f p (X;D,E) = Xtr(D(E+XI pxp )^ 1 ^j . Note that the 

derivative of / with respect to A is //(A; D, E) = — A tr (D (E + A/ pxp ) -2 ^ . 

This suggests that the function we should use in the derivative trick is 
f p { A; E, E c ); indeed the limit we want to compute is E A = — E //(A; E, E c ) 
Conveniently, the limit of f p is known by the Ledoit-Peche result (5): 


fp{ A; E, E c ) — > a .s. k( A) 


1 


1 


7 \Xv(—X) 


- 1 


for all A £ S := {u + iv : v ^ 0, or v = 0, u > 0}. 

Next we check the conditions for applying Vitali’s theorem, Lemma 7.8. 
By inspection, the function / p (A; E, E c ) is an analytic function of A on S with 
derivative 


dn 

dX 


«'(A) 


v — Xv' 
'y(Xv) 2 ' 


41 




















Furthermore, f p ( A;E,E C ) is bounded in absolute value: |/ P (A;E,E C )| < 
Mk < B 

\ — A ■ 

This shows that f p ( A; E, E) is a bounded sequence. Therefore, we can apply 
Lemma 7.8 to the sequence of analytic functions f p on the domain S, on the set 
of full measure on which f p converges, to obtain that the derivatives also con¬ 
verge on S: f p (X ; E, E) —t a . s . k'(X). Finally, since the functions f p are bounded, 
the dominated convergence theorem implies that E/ p (A;E,E) —► k'(X). Since 


E 


A 


= -E 


fp(X; E, E c ) 


this shows that E 


A 


—a 2 n'(A), as required. 


Finally, to see that the limit of E A > 0, we use that by assumption 


the eigenvalues of E are lower bounded by some b > 0, and those of E c = 
n _ 1 E 1 / 2 y T Pl/E 1 / 2 are upper bounded by B n~ 1 \\V r V\\ 2 1 hence 




a 2 

( \~ 2 X 

E 

A 

= —E 

V 

tr f E (^E c + A Ipxpj J 


> 


a 2 b 


Bn- 1 ||V t V|| 2 ' 


Since V has i.i.d. standard normal entries, it is well known that the operator 
norm of n - 1 || 1 F T f / ||2 is bounded above almost surely (see Theorem 5.8 in (Bai 
and Silverstein, 2010), which requires only a fourth moment on the entries of 


V). Therefore, liminf p E 


A 


> c > 0 for some fixed constant c > 0 . 


a 


7.1.5 Proof of Lemma 7.5 

We want 


to find the limit 

-l 


where M = 


of G = (£-$) M 

^E c + A Ipxp^j E ^E c + A/pxp) ■ By the decomposition (21) we have 5 — 5 = 
n~i/ 2 Ei/ 2 Z, so C = ^Z t Y}/ 2 MY}/ 2 Z. Since Z ~ Af(0,I p ) independently of 
E c (and thus of M), we can write E C = A tr (E 1 / 2 ME 1 / 2 ) = ^ tr (Q 2 ), where 

Q\ •= E ^E c + A I^j . By the concentration lemma 7.6, C — E C — > a .s. 0, so 

all that remains is to find the limit of Xo = p~ x tr ( Q 2 ). 

Lemma 1 in Hachem et al. (2008) states that, under the assumptions of 
Theorem 3.1, one has Var [xo] = 0(l/n 2 ), and also E [xo] = L + 0(l/n 2 ) for a 
certain deterministic quantity L. These results imply that xq —¥ a . s . L. However, 
the form of the limit L given in that paper is not convenient for our purposes. 

A convenient form for the limit is obtained in Chen et al. (2011). Their 
convergence result holds in probability, which is weaker than what we need; 
this explains why we also need the results of Hachem et al. (2008). Chen et al. 
(2011) consider sequences of problems of the form Vi ~ud A f{ii p , E p ), where E p 
is a sequence of covariance matrices that obeys the same conditions we assumed 
in the statement of Theorem 3.1, and the fi p are arbitrary fixed vectors. They 


form the sample covariance matrix S n = (n — 1) 1 — v )( v i ~ v ) 1 , where 
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v = n 1 i v i- addition to the above conditions, they also assume the 
additional condition y/n\p/n — y| — t 0. Then their result states: 

Lemma 7.9 (Lemma 2 of Chen et al. (2011), pp 1357). Under the above con¬ 
ditions, we have the convergence in probability 


1 

- tr 
P 


E p {S n + A I) 



—t P ©2(A,7), 


where ©2 (A, 7 ) is defined in the statement of Theorem 1 of Chen et al. (2011), 
on pp 134 8 

1 - Am(-A) _ m(-A) - Xm'(-X) 

2 ’ (1 — 7 + 7 Am(—A )) 3 (1 — 7 + 7 A?tt(—A )) 4 


Lemma 7.9 is stated for the usual centered covariance matrix S n . In Lemma 
7.5, the covariance matrix of interest is E c = {n — 2) _ 1 E 1 / 2 C T PCE 1 / 2 , where 
P is the projection matrix P = I pxp — 2n _ 1 (e+ie!j ! 1 + e-idjJ. Similarly to 
what we already argued several times in this paper (e.g. in Lemma 7.7), the 
two covariance matrices have identical limit ESD. Therefore Lemma 7.9 applies 
to our setting. Combining this with Lemma 1 in Hachem et al. (2008), we have 
the convergence: 


1 

-tr 

P 



—ta.s. ©2 (A, 7). 


Next, we notice by the definition of v in (4) that 1 — 7 + A) = Xv(X), 

as well as 1 — Xm(— A) = 7 -1 (l — Xv(— A)); and by taking derivatives m(—A) — 
Xm'(-X) = 7 _ 1 (u(—A) — Xv'(-X)). We rewrite the limit ©2 in terms of v: 

1 — Xv v — Xv' v' — v 2 

~ 2 ’ ^ 'j(Xv) 3 'y(Xi)) 4 r )X 2 v A 

Therefore, the limit of C equals 762 , as stated in Lemma 7.5. 


7.1.6 Proof of Lemma 7.6 


We will use the following Trace Lemma quoted from Bai and Silverstein (2010). 

Lemma 7.10 (Trace Lemma, Lemma B.26 of Bai and Silverstein (2010)). Let 
y be a p-dimensional random vector of i.i.d. elements with mean 0. Suppose 
that E \yf\ = 1, and let A p be a fixed p x p matrix. Then 


E [| y T A p y - tvA p \«] < C q {(E [y*\ tr[A p A],}) 


\<l/2 


-E 


2 q 

Vl 


tv[(A p Ay/ 2 ]} 


for some constant C q that only depends on q. 
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Under the conditions of Lemma 7.6, the operator norms 1111 2 are bounded 
by a constant C, thus tr[(A p Aj) 9 / 2 ] < pC q and tr[A p Aj] < pC 2 . Consider 
now a random vector x with the properties assumed in the present lemma. For 


y = y/pxja and q = 2+77/2, using that E 



< C and the other the conditions 


in Lemma 7.6, Lemma 7.10 thus yields 


j2q 


E 


| x T A p x 


2 

— tr A p \ q 
P 




or equivalently E 


A p x - ^ tr A p \ 2+rj < C'p _ ( 1+J) / 4 ). Since this bound 
is summable in p , Markov’s inequality applied to the 2 + 77 -th moment of 
e p = x T A p x — ^ tr A p and the Borel-Cantelli lemma yield the almost sure 
convergence e p — > a .s. 0 . 


7.2 Proof of Theorem 3.2: Unequal sampling 

This proof is very similar to that of Theorem 3.1, so we will omit some details. 
Suppose that the classes have unequal probabilities P {jji = +1) = 7 r+, P(yi = 
— 1) = 7r_, and the conditional model x\y ~ J\f E) holds. The Bayes optimal 
classifier is sign(/(a;)), where f(x) = (x — {y+\ + 7 i_i)/ 2 ) T E _1 (/r + i — p_i) + 
log(7r + /7r_). Using the same notation as before, f(x) equals f(x) = £ T E -1 <5 — 
/2 T E _1 (5 + log(7r_(_/7r_). The error rate of a generic linear classifier sign(a: T ui-|-6) 
conditional on w, b equals (16). 

We observe n+i samples with label yi = 1, and n_i samples with label 
2 /i = —1. We consider a general regularized classifier sign(/^(a:)), where f\(x) = 
x T w\ + b, and w\ = (E c + AJ pxp ) _1 i5, b = -/t T w\ + c. 

As usual, we have S = (/t+i — /t_i)/2, jl = (/t + i + /i_i)/2 and jl±\ = 
'Yh{i-y i = z ti} x i/ n ±i- The centered sample covariance matrix E c retains its origi¬ 
nal definition. 

We evaluate the limits of the linear and quadratic forms in the error rate, 
arising when we replace w and b by w and b, respectively. We assume that 
the same regularity conditions as in Theorem 3.1 hold, with the additional 
requirement that the ratios p/n±i each converge to positive constants: p/n± 1 —> 
7 - 1-1 > 0. For two independent 75 -dimensional standard normal random variables 
Z± 1 , which are also independent of S, p,, we have the stochastic representation 


A = p 



5 = 6 + ^S 1 / 2 




The limit of b = —pJwx + c: To evaluate this limit, we expand the inner 
product —pAwx using the stochastic representation of 5. As in the proof of 
Theorem 3.1, most terms in the expansion tend to 0 due to independence. 
Denote by A n ss B n that two random variables are asymptotically almost surely 
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equivalent, i.e. \A n — B n | —¥ a . a . 0. We have by arguments similar to those in 
Theorem 3.1 that 

-fi T w x » E—E 1 / 2 ^ + A/pxp)" 1 ^ 1 / 2 ^.! - E 1 / 2 (S c + A/p Xp ) _1 E 1 / 2 Z 1 



The limit of We write fjjwx = (n + <5) T (E C + XI pxp ) 1 S, and 

[ijw x « (5 T (E C + A/pxp) -1 ^ ~ ot 2 m{— A). Similarly « — ot 2 m(— A). 

The limit of wjE(h>: On expanding this expression using the stochastic 
representation, the cross-terms vanish asymptotically. Denoting M = (E c + 
A/pxp)~ 1 S(E c + A/pxp) _1 , we obtain 

wTEw a « < 5 t E 1 ' /2 ME 1 / 2 5 + —ZliMZ_i + -J-ZjMZi 

4n_i Arii 

2 u — Ac' 7 _i + 7 + i if — u 2 
a 'y(Xv) 2 4 A 2 u 4 

Let us denote by Q the limit obtained. 

Putting everything together: Using the above formulas, we see that as 
claimed in Eq. 11, Erro ( w\,b ) —*- a . s . U: 


U = 7r_<f> 


—a 2 m(— A) + 7 x 4 7+1 k(A) + c\ / a 2 m(— A) + 4 

— 7 = I +7T+$ I 


k(A) + c 


7.3 Proof of Corollary 3.3 

From Theorem 3.1, the limit error rate is $(—0), where 0 = 

a 2 m (-A)/i/a 2 r(A) + 7 g(A), and 


r(A) = lim E- tr ( E (s c + A i) } ; g(A) 

p—>oo p \ \ J ) 


= lim E- tr 

p—> oo p 


(e c + a/)" 


E 


Since E = /p X p, we have r(A) = q( A), and both are equal to the limit of 
-2N “ 


tr 


(e c + a/)' 


Analogously to the proof of Lemma 7.5, this limit 

mA-X; 7) 


equals m' T (— A; 7 ). Therefore 0 simplifies to /' , ,, 

V a +7 yJm'A-\-,i) 

The quantity m/(—A; 7 ) can be expressed in terms of m; by differentiating 
the Marchenko-Pastur equation: mi{z\ 7 ) = 1/(1 — z — 7 — "fzmi(z; 7 )). We get 
m! = m 2 (l + 7 m)/(l — 7 zm 2 ), which leads to the claimed expression for 0. For 
7 = 1 we get the required formula from Eq. (7) after some calculations. 
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7.4 Note on the Ledoit-Peche result ( 5 ) 

Ledoit and Peche (2011) prove (5) in their Lemma 2. Our notation differs from 
theirs: 7 here is equal to their 7 -1 (because the role of n,p is reversed); £ here 
is Sn, and —A here corresponds to z. Their limit is given in terms of m, but 
simplifies to 


k(A) 


1 — Ato(—A) 

1 — 7(1 — Am(—A)) 


1 

7 


(m-a) 



Their theorem requires a finite 12th moment on £ _ 1 /, 2 Xj, which holds in our 
case. Their results are stated for uncentered covariance matrices, but it is well 
known that centering does not affect limit of the eigenvalue distribution of £, so 
the result still holds (see, e.g., p. 39 of Bai and Silverstein, 2010). Furthermore, 
the normalization factor 1/n in front of the covariance matrix can be replaced 
by l/(n — 2), as needed by Regularized Discriminant Analysis. This follows by 
standard perturbation inequalities in the Frobenius norm (see Corollary A.42 in 
Bai and Silverstein, 2010). The calculation is fleshed out in detail in the proof 
of Lemma 7.5. Finally, their convergence is stated for z € C + , but by standard 
arguments it can be extended to real z < 0 . 


7.5 Proof of Theorem 3.5 

We will use the representation of the margin from theorem 3.1. To evaluate 
the necessary limits, it is helpful to represent the Stieltjes transforms and their 
derivatives as expectations with respect to the ESD. Thus, let Y be a random 
variable distributed according to the ESD F, and let 7 be a random variable 
distribued according to the companion ESD F. Then m, v are the Stieltjes 
transforms of Y, Y_, respectively. Hence 


i(—A) =E 


1 


F + A 


i'(-A) = E 


(Y + A ) 2 


m(—X)—Xm'(—X) = E 


Y 


(Y + A ) 2 
(27) 

Further Xv(—X) = 1 + 7 (Am(—A) — 1), so Au(—A) = 1 — 7 E . 

The error rate of LDA: As explained in Section 6.3, for 7 < 1, Y is sup¬ 
ported on a compact set bounded away from 0, so 7 > c > 0 for some c. This 
will allow us to take the limits as A f 0 inside the expectation, using the domi¬ 
nated convergence theorem; we will not repeat this fact. Using equation (27), we 
have lim AJ . 0 to(-A) = E [F _1 ], lim A 4 . 0 [w(-A) - AU(-A)] 7 -1 = lim A4 . 0 m(-A) - 
Xm'{— A) = E [F- 1 ], and by the formula for Xv, lim A ^o Xv(—X) = 1 — 7 (which 
was already shown in Section 6.3). 

Differentiating the formula for the companion Stieltjes transform, we see 
A 2 z/(—A) = 1 + 7 (A 2 m'(—A) — l). Hence, 


v'(-X) 1 +7 (A 2 m'(-A) - l) lim A |o [l+7 (A 2 to'(—A) — l)] 1 

a™ v 2 {-X) _ Apo A 2 u 2 (—A) ~ (I- 7) 2 “ 1 ^ 7 ’ 
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A 

(1-+A) 2 


= 0. We con- 


where we have used that lim^o X 2 m'(— A) = liniA|o E 

elude that lim^o v'(—X)/v 2 (—X) — 1 = 7/(1 — 7 ). Putting everything together, 
we find 


©lda = lim ■ 


9 

a r 


(1-7) 


7 

1-7 


Ato ^/a 2 ri + 6 ^ a 2 E [y-i] + _ 

Let T be a random variable distributed according to the PSD H. By taking 
the limit as z —> 0, z £ C + in the Marchenko-Pastur equation below, (the 
validity of the limit is justified by arguments similar to those in Section 6.3), 

dH(t) 


* (2)= L 


/i =0 U 1 -'y-'yzm {z))-z' 
we find m( 0 ) = f t ^_^ dH(t), or equivalently E [y _1 ] = E [X 1 ” 1 ] /(I — 7 ). 
This leads to the formula for ©lda- 


The error rate of IR: We have Xm(—X) = E 


Y +A 


. As explained in 


Section 6.3, Y is a bounded random variable, so lim^oo Xm{— A) = 1. Similarly 
limv-^oo Xv(—X) = 1. Next, we note 


lim X 2 [m(—A) — Xm! A)] = lim E 


A —>00 


A—>-oo 


A 


y + A 


y 


= E [Y]. 


Finally, we evaluate the limit of A 2 ( 72 ^ 7 ^ — 1^ • Noting that Xv tends to 
1, it is enough to find the limit of A 4 (t/(—A) — v 2 {— A)). We compute 


X 2 {v'(-X)^v 2 (-X))=E 


A 

y + A 


-E 


A 

y + A 


E 


1 - 


y 


— 1 — E 


y 


-1 \ 2 


y + A 


y + A 

Therefore, 

lim A 4 (r/(—A) — f 2 (—A)) = lim < E 

A—>-oo A—>-oo I 


= E 


y 


y + A 


-E 


y 


1 2 


y 2 


A 


y + A 


-E 


Y- 


y + A 


A 


y + A 


= e [y 2 ] - e [yf. 


Using the relationship F = 7 F + (1 — 7 )<So, we can write E [Y] = 7 E [y] and 
E [y 2 ] = 7 E [y 2 ] . Putting everything together, we find 


©ir = lim 


a 2 rA 


A ^°° \/a 2 ??A 2 + 9X 2 ^ a 2 E [yj + 7(e [y 2 ] _ 7E [y] 2 ) 

Finally, it is known that E [y] = E [T], E [y 2 ] = E [T 2 ] + 7 E [T] 2 (see, e.g., 
Lemma 2.16 in Yao et al. (2015)). This leads to the claimed formula. 
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7.6 Proof of Corollary 3.6 

From Theorem 3.5, minimizing ©lda is equivalent to minimizing E# [T _1 ] for 
H £ T-L{k\,k 2 ). By Jensen’s inequality, E^rp 1 ” 1 ] > 1/E H [T\ = 1; with equality 
if H = Si. This shows the first claim. 

Again by Theorem 3.5, minimizing 0ir over H £ 'H{k\ ) k 2 ) amounts to 
maximizing E# [T 2 ] over that class. For this, note that k\ <T < k 2 for a random 
variable T distributed according to H £ T-L{k\, k 2 ). Therefore (T — k\){T — k 2 ) < 
0, and taking expectations we get the upper bound 


E h[T 2 } < (fci + k 2 )E H [T] — k\k 2 = k\ + k 2 — k\k 2 . 

This upper bound is achieved for any H = 'W\Sk 1 + w 2 6k 2 ■ It is now easy to 
check that there exists a unique set of weights wt such that a distribution of the 
above form has unit mean, so that it belongs to 'H(ki,k 2 ); and those are the 
weights given in the corollary. 
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